How to Measure Voice Agent Quality
Most voice agent teams measure the wrong things. They watch deflection rate and call duration; they ignore the quality of what happened inside the call. The result: agents that look good on dashboards and feel bad on the phone.
Most voice agent teams measure the wrong things. They watch deflection rate and call duration; they ignore the quality of what happened inside the call. The result: agents that look good on dashboards and feel bad on the phone. This is the small set of metrics that actually correlate with quality, and the workflows to capture them.
TL;DR
- Quality has three dimensions: correctness, conversational feel, and operational health.
- The single most useful metric: a 1โ5 human-graded score on a sample of real calls per week.
- Automated metrics (latency, deflection, AHT) are necessary but not sufficient.
- Measure CSAT gap between AI and human calls, not absolute CSAT.
The three quality dimensions
Quality breaks into:
Correctness. Did the agent do the right thing? Did it look up the right account, book the right time, give the right answer?
Conversational feel. Did it sound natural? Did it handle interruptions gracefully? Did it know when to escalate?
Operational health. Latency, error rate, cost per call, p99s.
Each needs different instrumentation.
Correctness metrics
The hardest to automate. The pragmatic approach:
Human-graded sample. Pull 20โ50 calls per week. Score each turn (or each call) on:
- Did the agent call the right tools?
- Did it give correct information?
- Did it handle the user's actual intent?
- Score 1 (clearly wrong) to 5 (could not have been better).
Track the median score over time.
Resolution rate. Percentage of calls where the agent achieved the call's goal (booking made, ticket resolved, lead qualified) without human handoff. Should be 60โ80% for mature use cases.
Escalation appropriateness. Of the calls that did escalate, were they correctly escalated? (Not "could the agent have handled it" but "given the agent's confidence, was escalating the right move?")
Conversational feel metrics
Harder to quantify but critical to track.
Repair rate. How many turns per call did the user have to correct the agent? Lower is better.
Backchannel handling. Did the agent stop talking when the user said "uh-huh" or "right"? (False positives are bad โ these aren't real interruptions.)
Awkward silences. Number of times in a call where the agent took >2 seconds to respond. Lower is better.
Tone match. Subjective: does the agent's tone match the brand? Score weekly on the human-grader sample.
Operational health metrics
Easy to automate; necessary baseline.
Latency. End-to-end response time. Track median, p95, p99. Median should be under 700ms; p99 under 1500ms.
Error rate. Calls that failed to complete due to system errors (LLM timeout, telephony failure, etc.). Should be under 0.5%.
Cost per call. Total infrastructure cost / call count. Should be under $0.50 for a typical 3-minute support call.
Audio quality issues. Calls where audio glitches or recognition failures occurred. Should be under 2%.
What CSAT actually tells you
Customer Satisfaction is the ultimate quality metric, but raw CSAT for AI agents is misleading. The right metric is the CSAT gap: AI-handled calls vs human-handled calls.
If your humans get 4.3 CSAT and your AI gets 4.0, that's a 0.3 gap. Acceptable for many use cases.
If your humans get 4.3 and your AI gets 3.2, that's a 1.1 gap. You have a quality problem.
The absolute number depends on your audience, channel, and survey methodology. The gap is the actionable signal.
How to actually do this
A practical weekly cadence:
Monday. Pull 30 random calls from the previous week. Skim transcripts. Listen to 5 of them.
Tuesday. Score the 30 on the human-grader rubric. Note specific failure patterns.
Wednesday. Pull operational health dashboards. Investigate any spikes.
Thursday. Iterate the system prompt or escalation rules based on what you saw.
Friday. Run evals: replay 50 historical calls through the new prompt; compare to the old.
This is roughly 4 hours of work per week per agent, after the first month. It's the minimum to maintain a production-quality agent.
For the eval workflow specifically, see how to A/B test voice agent prompts.
What metrics to ignore (or de-prioritize)
A few popular metrics that are misleading:
Average Handle Time. Lower isn't always better โ a fast agent that escalates everything has low AHT but bad outcomes.
Deflection rate alone. "We deflect 70% of calls" sounds great but doesn't tell you whether the deflected calls were resolved or just dropped on the floor.
Token count. Mostly useful for cost; not for quality.
Word Error Rate (WER). Useful for STT diagnostics; not predictive of overall agent quality.
What to measure first
If you're starting fresh, measure these in this order:
- Resolution rate (does the agent finish the job?)
- Human-graded score on weekly sample (does it do it well?)
- Latency p50/p99 (does it feel responsive?)
- Cost per resolved issue (does the math work?)
- CSAT gap (does the customer notice?)
Add more once these are stable.
Related reading
- What Is a Voice Agent? A 2026 Primer
- First-Time Builder's Guide to Voice Agents
- Why Voice AI Will Transform Phone Channels by 2030
- Voice Agent Use Cases: A Field Guide
- How Voice Agents Recover from Misunderstandings
FAQ
How small can my sample size be for human grading? 20 calls per week is the floor for catching clear failures. 50/week gets you statistical signal on changes. Below 20, you're guessing.
Should I use AI to grade my AI? You can โ LLM-as-judge approaches work for specific rubrics. They're a useful supplement to human grading, not a replacement.
What about real-time quality monitoring? Some platforms support real-time sentiment scoring during calls. Useful for live escalation triggers; less useful for quality measurement (which benefits from after-the-fact human review).
How often should I refresh my eval set? Add 10โ20 new calls each week to your eval set. Replace older calls as your use cases evolve.
Can I trust the platform's built-in metrics? Verify them once. After that, trust but spot-check periodically. Most platforms compute these correctly; some have subtle bugs.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all โOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
First-Time Builder's Guide to Voice Agents
Building your first voice agent is mostly about resisting the urge to overengineer. You don't need to compare 8 LLMs. You don't need to design a multi-agent architecture. You need to get a single bounded agent on the phone, listen to it talk to real humans, and iterate.
Why Voice AI Will Transform Phone Channels by 2030
The phone is not going away. Despite a decade of "the phone is dying" predictions, U.S. consumers still place over 30 billion service calls a year. What's changing is what answers them.
Voice Agent Use Cases: A Field Guide
The "voice AI for customer service" pitch has gotten so widespread that it's hard to remember how many specific use cases live underneath it. Some are mature and ready to deploy. Some are still painful.
Voice AI, twice a month.
Get the best of the SIMBA resources hub โ new articles, trend notes, and operator guides. No spam.
