A voice agent that works in a demo is a different product from one that works in production. The demo only has to handle the happy path with a friendly tester. Production has to handle 3am calls from frustrated customers, edge cases your prompt didn't anticipate, third-party API outages, and the long tail of weird inputs that real life produces. This is the checklist for the gap between the two.

TL;DR

Production-ready voice agents have an evaluation harness, robust escalation, monitoring, and someone whose job includes operating them.
Three failure modes kill most pilots: no graceful escalation, no logging, no owner.
The technical bar is doable in 2026; the operational bar is what separates winners from churners.

The 12-item checklist

Walk through these before flipping any voice agent to live traffic.

1. Defined success criteria

Before deployment, you should have a one-paragraph answer to: what does this agent need to do, for what percentage of calls, with what fallback? Without this, you'll never know if it's working.

2. An evaluation harness

A way to grade real calls. The minimum: a sample of 20–50 calls per week, scored on a rubric (correctness, tone, escalation appropriateness, latency). The bar isn't fancy tooling; it's discipline. See how to A/B test voice agent prompts.

3. Graceful escalation

When the agent can't handle a call, it should hand off cleanly. Escalation criteria written down. Transfer mechanism tested. Context summary generated for the human agent receiving the transfer. Most agent failures are escalation failures, not understanding failures.

4. Logging and observability

Every call gets a transcript, a duration, a status, and any errors logged centrally. Tool-call payloads recorded. Latency tracked at p50/p95/p99. If something goes wrong, you can pull up the call and see what happened.

5. Cost monitoring

Per-minute cost broken down by component (telephony, STT, LLM, TTS) — alerted when it crosses a budget. Easy to overspend without realizing.

6. Failure-mode handling

What happens when:

The LLM endpoint is slow? (Cap timeout, escalate.)
The CRM lookup fails? (Tell the caller, log, escalate.)
The audio quality is bad? (Detect, ask the caller to repeat or switch channels.)
The caller's account isn't found? (Predefined response.)

Each failure mode needs an explicit handler. Without them, the agent's behavior is whatever the LLM hallucinates in the moment.

For inbound calls, you need to disclose that the call may be recorded. For outbound, you need TCPA compliance (or your jurisdiction's equivalent). Disclosure language reviewed by legal. See TCPA compliance for AI-powered outbound calls.

8. Data handling and retention

What gets stored, for how long, with what access controls. PII redaction in logs. Audio retention policy documented. If you're in healthcare, HIPAA compliance assessed.

9. A specific owner

One person whose job description includes operating this agent. Not "the team." A person. Without an owner, quality drifts within weeks.

10. A regression playbook

When the agent's behavior changes (you tweaked the prompt, swapped the model), how do you verify nothing broke? At minimum, replay 20 historical calls through the new version and compare.

11. Channel and load testing

The agent works on a quiet test line. Will it work at 200 calls/hour? On a flaky cell connection? When the LLM provider has a regional outage? Test before launch, not after.

12. A kill switch

A way to immediately revert to the prior version (or to "all calls go to humans") if quality drops. Tested in staging. Triggerable by someone other than the engineer who built it.

The three failure modes that kill pilots

Across many launches, three recurring patterns sink projects:

No graceful escalation. The agent gets stuck and doesn't know it. Caller has to demand a human. Human gets the call with no context. Caller is angry. Internal team blames the AI.

No logging. Something goes wrong but no one can debug it because the call wasn't captured. Quality regresses silently. Stakeholders lose faith.

No owner. The agent ships, the launch team moves on, no one's tracking quality. Three months later it's degraded and no one noticed.

Solve these three and you've solved 80% of what kills voice agent deployments.

What "good enough" looks like

A reasonable production-ready bar for a typical inbound support agent in 2026:

Resolves 60%+ of calls without human handoff
Median latency under 600ms
p99 latency under 1.5 seconds
CSAT within 5–10 points of human-handled calls
Less than 1% of calls have a clear failure (audio glitch, agent confusion, wrong info)
Per-call cost under $0.50
Owner reviewing 20 calls/week

If you're below any of these significantly, you have work to do before scaling traffic.

What you don't need on day one

A few things teams overinvest in early:

A perfect persona. Good enough is fine; iterate post-launch.
A massive knowledge base. Start with the top 20 questions; add more based on what actually gets asked.
Multi-language support. Pick one language. Add more once the first works.
Custom voice cloning. A stock voice from your TTS provider is fine for the pilot.
Real-time analytics dashboards. A weekly CSV is plenty until you have volume.

Ship narrow, learn fast, expand from there.

FAQ

How long does production-readying a voice agent take? For a single bounded use case, 4–8 weeks from "demo works" to "running at scale." Most of that is operational, not technical.

Can I skip the eval harness? Yes, and you'll regret it. Without evals you can't tell when changes hurt quality.

What's the simplest eval setup? A spreadsheet with 50 real call transcripts, a 1–5 score per turn on accuracy and tone, and a weekly review.

Do I need a human in the loop? For most use cases, only for escalations. Some teams keep a "shadow mode" where humans review every Nth call for the first month after launch.

What if I don't have an owner to assign? You're not ready to deploy a voice agent. Either find one or postpone.

What Makes a Voice Agent "Production Ready"

TL;DR

The 12-item checklist

1. Defined success criteria

2. An evaluation harness

3. Graceful escalation

4. Logging and observability

5. Cost monitoring

6. Failure-mode handling

8. Data handling and retention

9. A specific owner

10. A regression playbook

11. Channel and load testing

12. A kill switch

The three failure modes that kill pilots

What "good enough" looks like

What you don't need on day one

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

How to Measure Voice Agent Quality

Is AI Too Slow for Real Phone Calls? Latency Engineering for Voice Agents

What Happens If an AI Voice Agent Crashes Mid-Call? Reliability and Failover Explained

Voice AI, twice a month.