🧠 Conversational AI & LLMs

Guardrails for Voice Agents: A Pragmatic Take

Guardrails are the rules that prevent your voice agent from doing things it shouldn't — agreeing to refunds it can't authorize, giving medical advice, leaking PII, or making up policies.

Tyler Weitzman
Tyler Weitzman
January 17, 2026 · 6 min read
Speechify

Guardrails are the rules that prevent your voice agent from doing things it shouldn't — agreeing to refunds it can't authorize, giving medical advice, leaking PII, or making up policies. The space is over-hyped (most production agents don't need a separate "guardrails platform") and under-implemented (most production agents don't have enough). This is the practical middle.

TL;DR

  • The first guardrail is your system prompt. Use it ruthlessly.
  • The second guardrail is structured function calling. The agent can't approve a refund if there's no approve_refund function.
  • For high-stakes use cases, a separate guardrails layer (a smaller model that screens responses) helps.
  • Don't reach for fancy tooling before exhausting prompt + tool design.

The three layers of guardrails

Layer 1: prompt rules

Most guardrails belong in the system prompt. Examples:

- Never quote a price unless you've called get_pricing.
- Never agree to a refund. Always escalate refund requests
  to a human via transfer_to_human.
- Never give medical, legal, or financial advice. If asked,
  recommend the caller speak to a qualified professional.
- Never claim to be a human. If asked directly, disclose
  that you are an AI assistant.

These work because LLMs are pretty good at following clear instructions when the instructions are explicit and the use case is bounded.

What makes prompt rules unreliable:

  • Vague language ("be careful with promises")
  • Conflicting rules
  • Long lists where rules at the bottom get less attention
  • No examples of correct behavior

Keep rules short, explicit, and ideally with one example each.

Layer 2: structured function calling

The structural guardrail. The agent can only do what your functions let it do.

Examples:

  • No cancel_subscription function → the agent can't cancel a subscription. The worst it can do is say it'll do it (which is bad) but it physically can't.
  • approve_refund(amount) with a parameter limit (max_amount: 100) → the agent can't approve refunds over $100.
  • All sensitive actions require a confirmation step.

This is your most reliable guardrail. The LLM might be persuaded to say it'll do something it shouldn't, but it can't actually execute it without a function.

For the design pattern, see function calling for voice agents: a practical guide.

Layer 3: external guardrails

For high-stakes use cases, a separate model screens the LLM's outputs before they reach the user.

Examples:

  • A small classifier checks whether the agent's reply contains PII it shouldn't.
  • A topic detector flags responses about legal/medical/financial advice.
  • A sentiment monitor flags responses that sound dismissive or hostile.

When triggered, the response gets blocked or modified, and the call may escalate.

Use this layer if:

  • Your use case has serious legal exposure (healthcare, finance).
  • You've seen prompt-only guardrails fail in production.
  • You need auditable evidence that screening happened.

Don't use it if:

  • Your use case is bounded enough that prompt + function-call guardrails suffice.
  • You're early in your build (don't add complexity prematurely).

What guardrails actually fail at

Honest list of patterns where guardrails struggle:

Adversarial users. A user who actively tries to jailbreak the agent ("ignore previous instructions"). Modern LLMs are pretty resistant to this but not bulletproof.

Edge cases the prompt didn't anticipate. "Can you waive the late fee just this once?" — a prompt that says "never give discounts" might allow this if the rule wasn't tight enough.

Subtle policy drift. The agent gradually starts saying things slightly outside policy. Drift is hard to detect without continuous evaluation.

Conflicting instructions. "Be empathetic" and "Don't make commitments" can conflict in upset-customer scenarios. The agent picks one or fudges both.

A pragmatic rule set

For most voice agents in 2026, a baseline guardrail set:

  1. Identity disclosure. "If asked, disclose that you are an AI assistant for [Company]."
  2. Scope discipline. "Only handle [defined topics]. For other topics, escalate."
  3. No off-the-cuff commitments. "Never agree to refunds, discounts, or policy exceptions. Escalate."
  4. PII handling. "If the caller shares PII not needed for the task, don't store or repeat it back."
  5. Safety topics. "If the caller mentions self-harm, abuse, or emergency, immediately escalate to a human or 911."
  6. Professional advice. "Do not provide medical, legal, or financial advice. Recommend qualified professionals."
  7. Brand-specific risks. Anything specific to your industry (HIPAA, fair lending, etc.).

Plus the structural guardrail of well-designed functions.

That's enough for most production agents. Add more as you observe specific failures.

Detecting guardrail failures

Three signals to monitor:

Prompt rule violations. Did the agent break a rule from the system prompt? Catch with eval grading.

Out-of-scope responses. Did the agent answer a question it shouldn't have? Tag in your eval set.

Sensitive topic mentions. Did the agent discuss something flagged (medical, legal, etc.)? Detect with a classifier on the transcript.

When you detect failures, look at the call. Was it a one-off LLM blip? A prompt gap? A user actively trying to break the agent? Each requires a different fix.

Red-teaming

For high-stakes deployments, periodically red-team the agent. Have someone try to:

  • Get it to make commitments it shouldn't.
  • Extract info it shouldn't share.
  • Convince it to ignore prompt rules.
  • Get it to behave inappropriately in edge scenarios.

Document what works; tighten the prompt or add structural guardrails. Repeat quarterly.

For more, see red-teaming your voice agent.

FAQ

Do I need a dedicated guardrails platform? For most use cases, no. Prompt + function design covers it. For HIPAA/finance use cases, maybe.

What's the most common guardrail failure? The agent making a commitment it shouldn't ("yes, I can refund that"). Almost always preventable with explicit prompt rules + no approve_refund function.

How do I keep the agent from giving advice? Explicit prompt rule + redirect language ("I can't advise on that — I'd recommend you speak with a qualified professional"). Works well.

Can the LLM bypass its own guardrails if pushed? Modern LLMs are reasonably robust. The structural guardrails (no function = can't do it) are more reliable than prompt-level rules.

Should I disclose to users what guardrails are in place? For high-stakes use cases, yes — "I can't approve refunds; let me get you to someone who can" sets clear expectations.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all →

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.