๐ŸŽ™๏ธ Voice AI Fundamentals

How to Measure Voice Agent Quality

Most voice agent teams measure the wrong things. They watch deflection rate and call duration; they ignore the quality of what happened inside the call. The result: agents that look good on dashboards and feel bad on the phone.

Tyler Weitzman
Tyler Weitzman
January 12, 2026 ยท 5 min read
Speechify

Most voice agent teams measure the wrong things. They watch deflection rate and call duration; they ignore the quality of what happened inside the call. The result: agents that look good on dashboards and feel bad on the phone. This is the small set of metrics that actually correlate with quality, and the workflows to capture them.

TL;DR

  • Quality has three dimensions: correctness, conversational feel, and operational health.
  • The single most useful metric: a 1โ€“5 human-graded score on a sample of real calls per week.
  • Automated metrics (latency, deflection, AHT) are necessary but not sufficient.
  • Measure CSAT gap between AI and human calls, not absolute CSAT.

The three quality dimensions

Quality breaks into:

Correctness. Did the agent do the right thing? Did it look up the right account, book the right time, give the right answer?

Conversational feel. Did it sound natural? Did it handle interruptions gracefully? Did it know when to escalate?

Operational health. Latency, error rate, cost per call, p99s.

Each needs different instrumentation.

Correctness metrics

The hardest to automate. The pragmatic approach:

Human-graded sample. Pull 20โ€“50 calls per week. Score each turn (or each call) on:

  • Did the agent call the right tools?
  • Did it give correct information?
  • Did it handle the user's actual intent?
  • Score 1 (clearly wrong) to 5 (could not have been better).

Track the median score over time.

Resolution rate. Percentage of calls where the agent achieved the call's goal (booking made, ticket resolved, lead qualified) without human handoff. Should be 60โ€“80% for mature use cases.

Escalation appropriateness. Of the calls that did escalate, were they correctly escalated? (Not "could the agent have handled it" but "given the agent's confidence, was escalating the right move?")

Conversational feel metrics

Harder to quantify but critical to track.

Repair rate. How many turns per call did the user have to correct the agent? Lower is better.

Backchannel handling. Did the agent stop talking when the user said "uh-huh" or "right"? (False positives are bad โ€” these aren't real interruptions.)

Awkward silences. Number of times in a call where the agent took >2 seconds to respond. Lower is better.

Tone match. Subjective: does the agent's tone match the brand? Score weekly on the human-grader sample.

Operational health metrics

Easy to automate; necessary baseline.

Latency. End-to-end response time. Track median, p95, p99. Median should be under 700ms; p99 under 1500ms.

Error rate. Calls that failed to complete due to system errors (LLM timeout, telephony failure, etc.). Should be under 0.5%.

Cost per call. Total infrastructure cost / call count. Should be under $0.50 for a typical 3-minute support call.

Audio quality issues. Calls where audio glitches or recognition failures occurred. Should be under 2%.

What CSAT actually tells you

Customer Satisfaction is the ultimate quality metric, but raw CSAT for AI agents is misleading. The right metric is the CSAT gap: AI-handled calls vs human-handled calls.

If your humans get 4.3 CSAT and your AI gets 4.0, that's a 0.3 gap. Acceptable for many use cases.

If your humans get 4.3 and your AI gets 3.2, that's a 1.1 gap. You have a quality problem.

The absolute number depends on your audience, channel, and survey methodology. The gap is the actionable signal.

How to actually do this

A practical weekly cadence:

Monday. Pull 30 random calls from the previous week. Skim transcripts. Listen to 5 of them.

Tuesday. Score the 30 on the human-grader rubric. Note specific failure patterns.

Wednesday. Pull operational health dashboards. Investigate any spikes.

Thursday. Iterate the system prompt or escalation rules based on what you saw.

Friday. Run evals: replay 50 historical calls through the new prompt; compare to the old.

This is roughly 4 hours of work per week per agent, after the first month. It's the minimum to maintain a production-quality agent.

For the eval workflow specifically, see how to A/B test voice agent prompts.

What metrics to ignore (or de-prioritize)

A few popular metrics that are misleading:

Average Handle Time. Lower isn't always better โ€” a fast agent that escalates everything has low AHT but bad outcomes.

Deflection rate alone. "We deflect 70% of calls" sounds great but doesn't tell you whether the deflected calls were resolved or just dropped on the floor.

Token count. Mostly useful for cost; not for quality.

Word Error Rate (WER). Useful for STT diagnostics; not predictive of overall agent quality.

What to measure first

If you're starting fresh, measure these in this order:

  1. Resolution rate (does the agent finish the job?)
  2. Human-graded score on weekly sample (does it do it well?)
  3. Latency p50/p99 (does it feel responsive?)
  4. Cost per resolved issue (does the math work?)
  5. CSAT gap (does the customer notice?)

Add more once these are stable.

FAQ

How small can my sample size be for human grading? 20 calls per week is the floor for catching clear failures. 50/week gets you statistical signal on changes. Below 20, you're guessing.

Should I use AI to grade my AI? You can โ€” LLM-as-judge approaches work for specific rubrics. They're a useful supplement to human grading, not a replacement.

What about real-time quality monitoring? Some platforms support real-time sentiment scoring during calls. Useful for live escalation triggers; less useful for quality measurement (which benefits from after-the-fact human review).

How often should I refresh my eval set? Add 10โ€“20 new calls each week to your eval set. Replace older calls as your use cases evolve.

Can I trust the platform's built-in metrics? Verify them once. After that, trust but spot-check periodically. Most platforms compute these correctly; some have subtle bugs.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ€” text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all โ†’

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub โ€” new articles, trend notes, and operator guides. No spam.