Quality Assurance for AI Voice Support
Quality assurance for AI voice support is mostly the same as QA for human contact centers — but with different staffing, different tools, and a much higher possible cadence. Done well, AI QA closes the loop between observation and prompt iteration in days instead of months.
Quality assurance for AI voice support is mostly the same as QA for human contact centers — but with different staffing, different tools, and a much higher possible cadence. Done well, AI QA closes the loop between observation and prompt iteration in days instead of months. Done poorly, it's a vanity exercise that produces dashboards no one acts on.
TL;DR
- QA for AI agents should grade real calls weekly against a clear rubric.
- Combine human QA (highest signal) with LLM-as-judge (scalable).
- The output of QA is a prompt change or process change — if not, the QA isn't useful.
- AI QA can audit 10-100x more calls than human QA at similar cost.
What QA is for
Three purposes:
1. Catch quality drift. Prompts and tools change; quality can regress silently. QA flags it.
2. Inform iteration. Specific failure patterns guide prompt changes.
3. Compliance evidence. For regulated industries, documented QA is a requirement.
If your QA serves none of these, it's wasted effort.
The rubric
Keep it tight. 5-7 dimensions max:
- Correctness. Did the agent give correct info / call the right tools? (1-5)
- Tone. Did the agent sound appropriate? (1-5)
- Concision. Was the reply appropriately brief? (1-5)
- Recovery. When something went wrong, did the agent handle it well? (1-5)
- Escalation. When escalation was needed, was it handled correctly? (1-5)
- Compliance. Did the agent follow disclosure / verification rules? (1-5)
- Customer outcome. Did the customer get what they needed? (1-5)
Score each per call (or per critical turn). Average across the sample.
Sampling
Random sample is the default. Stratify if certain intents are higher-stakes:
- 70% random sample of all calls
- 20% sample of escalated calls
- 10% sample of negative-sentiment calls
This catches both drift in routine cases and spikes in problem cases.
Volume: 30-50 calls per week minimum for human QA. 200-500/week for LLM QA.
Human grading workflow
A weekly cadence:
Monday. QA pulls 30 random calls from the previous week. Listens to a sample; reads transcripts of the rest.
Tuesday. Scores against the rubric. Notes specific failure patterns.
Wednesday. Reviews scores with operations team. Identifies prompt changes needed.
Thursday. Implements prompt changes. Tests against eval set.
Friday. Ships changes. Logs what was changed and why.
This is roughly one half-day per week of QA work after the system is set up.
LLM-as-judge
For scale, use an LLM to grade calls against the rubric. Same rubric, automated:
Given this call transcript and rubric, score each dimension
1-5 with brief justification:
[transcript]
[rubric]
Return JSON: { correctness: N, tone: N, ... }
Run on 200-500 calls per week. Aggregate. Flag outliers.
LLM-as-judge is noisier than human grading but scales. Best practice: validate against human grading on 10% of cases.
What QA finds
Common patterns:
Verbal padding. "Sure, no problem, let me see what I can do for you" → "Let me check."
Missed escalations. Agent grinding through cases that should have been transferred.
Confirm-back gaps. Agent acting on info without verifying.
Tone mismatch. Agent too casual / too formal for the brand.
Function-call bugs. Agent calling the wrong tool in specific scenarios.
Each is a fix.
Acting on QA findings
The discipline: every QA finding maps to either a prompt change, a tool change, a knowledge base update, or a process change.
If the finding doesn't map to any of these, it's not actionable — recategorize or drop.
Track actions over time:
- Findings identified
- Findings acted on
- Time from finding to fix
- Recurrence rate
QA for compliance
For regulated industries (healthcare, finance, legal):
- Document the QA process and rubric.
- Retain QA scores and findings.
- Demonstrate QA was acted on.
- Show QA covers all relevant compliance requirements (disclosure, verification, etc.).
Auditors want evidence. QA provides it.
Specific compliance checks
For voice agents in regulated contexts, additional QA dimensions:
- Did the agent disclose AI status?
- Did the agent verify identity before sharing PII?
- Did the agent provide required disclosures (recording consent, terms reference)?
- Did the agent handle PHI / PCI per policy?
Build these into the rubric for those use cases.
QA cost
Approximate cost for a mid-sized AI deployment (50,000 calls/month):
- Human QA (50 calls/week × 0.5 hrs/call): 25 hrs/week of QA staff.
- LLM QA (500 calls/week): ~$50/week in LLM costs.
Total: roughly 1 FTE of QA time for the year, plus minimal LLM costs.
For most deployments, this pays back many times over in caught regressions.
Common QA mistakes
Grading without acting. Scores in a spreadsheet that never trigger changes.
Rubric drift. Definitions change; scores aren't comparable over time.
Cherry-picking. Reviewing only the bad calls (selection bias).
No calibration. Different graders score differently; no inter-rater reliability checks.
For the broader measurement framework, see how to measure voice agent quality.
Related reading
- CSAT for AI Agents: Benchmarks and Frameworks
- What Is AI Deflection (and How to Measure It)
- The Definitive Guide to AI Customer Support in 2026
- Building a Tier-1 AI Support Agent Step by Step
- Why "Human-in-the-Loop" Beats "Fully Autonomous" for Most Teams
FAQ
How big should the QA sample be? 30-50 calls/week minimum for human grading. 200+ for LLM grading.
Should QA be done by the agent's builder or someone else? Someone else, ideally. The builder has blind spots.
Can I skip QA in early pilot? You can — and you'll regret it within weeks.
What about real-time QA (during the call)? Possible (sentiment monitoring, real-time escalation triggers). Useful as a complement, not replacement.
How do I calibrate QA scores across graders? Periodic inter-rater reliability sessions. 5-10 calls graded by everyone; compare; align.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all →Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
CSAT for AI Agents: Benchmarks and Frameworks
Customer Satisfaction (CSAT) is the closest thing to a north star for support agents. Tracking it for AI agents specifically — and comparing it against human-handled equivalents — is the single most useful operational habit for any team running customer-facing AI.
What Is AI Deflection (and How to Measure It)
"Deflection" is the most-cited and most-misunderstood metric in AI customer support. Vendors quote 80% deflection rates. Buyers don't always know what that means or how to verify it.
Why "Human-in-the-Loop" Beats "Fully Autonomous" for Most Teams
The fully autonomous AI customer service agent is the AI industry's preferred fantasy. The reality in 2026 is that the best-performing deployments are hybrid: AI handles most volume, humans handle the edge cases and provide supervision, and the line between them is carefully…
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
