Quality assurance for AI voice support is mostly the same as QA for human contact centers — but with different staffing, different tools, and a much higher possible cadence. Done well, AI QA closes the loop between observation and prompt iteration in days instead of months. Done poorly, it's a vanity exercise that produces dashboards no one acts on.

TL;DR

QA for AI agents should grade real calls weekly against a clear rubric.
Combine human QA (highest signal) with LLM-as-judge (scalable).
The output of QA is a prompt change or process change — if not, the QA isn't useful.
AI QA can audit 10-100x more calls than human QA at similar cost.

What QA is for

Three purposes:

1. Catch quality drift. Prompts and tools change; quality can regress silently. QA flags it.

2. Inform iteration. Specific failure patterns guide prompt changes.

3. Compliance evidence. For regulated industries, documented QA is a requirement.

If your QA serves none of these, it's wasted effort.

The rubric

Keep it tight. 5-7 dimensions max:

Correctness. Did the agent give correct info / call the right tools? (1-5)
Tone. Did the agent sound appropriate? (1-5)
Concision. Was the reply appropriately brief? (1-5)
Recovery. When something went wrong, did the agent handle it well? (1-5)
Escalation. When escalation was needed, was it handled correctly? (1-5)
Compliance. Did the agent follow disclosure / verification rules? (1-5)
Customer outcome. Did the customer get what they needed? (1-5)

Score each per call (or per critical turn). Average across the sample.

Sampling

Random sample is the default. Stratify if certain intents are higher-stakes:

70% random sample of all calls
20% sample of escalated calls
10% sample of negative-sentiment calls

This catches both drift in routine cases and spikes in problem cases.

Volume: 30-50 calls per week minimum for human QA. 200-500/week for LLM QA.

Human grading workflow

A weekly cadence:

Monday. QA pulls 30 random calls from the previous week. Listens to a sample; reads transcripts of the rest.

Tuesday. Scores against the rubric. Notes specific failure patterns.

Wednesday. Reviews scores with operations team. Identifies prompt changes needed.

Thursday. Implements prompt changes. Tests against eval set.

Friday. Ships changes. Logs what was changed and why.

This is roughly one half-day per week of QA work after the system is set up.

LLM-as-judge

For scale, use an LLM to grade calls against the rubric. Same rubric, automated:

Given this call transcript and rubric, score each dimension
1-5 with brief justification:
[transcript]
[rubric]
Return JSON: { correctness: N, tone: N, ... }

Run on 200-500 calls per week. Aggregate. Flag outliers.

LLM-as-judge is noisier than human grading but scales. Best practice: validate against human grading on 10% of cases.

What QA finds

Common patterns:

Verbal padding. "Sure, no problem, let me see what I can do for you" → "Let me check."

Missed escalations. Agent grinding through cases that should have been transferred.

Confirm-back gaps. Agent acting on info without verifying.

Tone mismatch. Agent too casual / too formal for the brand.

Function-call bugs. Agent calling the wrong tool in specific scenarios.

Each is a fix.

Acting on QA findings

The discipline: every QA finding maps to either a prompt change, a tool change, a knowledge base update, or a process change.

If the finding doesn't map to any of these, it's not actionable — recategorize or drop.

Track actions over time:

Findings identified
Findings acted on
Time from finding to fix
Recurrence rate

QA for compliance

For regulated industries (healthcare, finance, legal):

Document the QA process and rubric.
Retain QA scores and findings.
Demonstrate QA was acted on.
Show QA covers all relevant compliance requirements (disclosure, verification, etc.).

Auditors want evidence. QA provides it.

Specific compliance checks

For voice agents in regulated contexts, additional QA dimensions:

Did the agent disclose AI status?
Did the agent verify identity before sharing PII?
Did the agent provide required disclosures (recording consent, terms reference)?
Did the agent handle PHI / PCI per policy?

Build these into the rubric for those use cases.

QA cost

Approximate cost for a mid-sized AI deployment (50,000 calls/month):

Human QA (50 calls/week × 0.5 hrs/call): 25 hrs/week of QA staff.
LLM QA (500 calls/week): ~$50/week in LLM costs.

Total: roughly 1 FTE of QA time for the year, plus minimal LLM costs.

For most deployments, this pays back many times over in caught regressions.

Common QA mistakes

Grading without acting. Scores in a spreadsheet that never trigger changes.

Rubric drift. Definitions change; scores aren't comparable over time.

Cherry-picking. Reviewing only the bad calls (selection bias).

No calibration. Different graders score differently; no inter-rater reliability checks.

For the broader measurement framework, see how to measure voice agent quality.

FAQ

How big should the QA sample be? 30-50 calls/week minimum for human grading. 200+ for LLM grading.

Should QA be done by the agent's builder or someone else? Someone else, ideally. The builder has blind spots.

Can I skip QA in early pilot? You can — and you'll regret it within weeks.

What about real-time QA (during the call)? Possible (sentiment monitoring, real-time escalation triggers). Useful as a complement, not replacement.

How do I calibrate QA scores across graders? Periodic inter-rater reliability sessions. 5-10 calls graded by everyone; compare; align.

Quality Assurance for AI Voice Support

TL;DR

What QA is for

The rubric

Sampling

Human grading workflow

LLM-as-judge

What QA finds

Acting on QA findings

QA for compliance

Specific compliance checks

QA cost

Common QA mistakes

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

CSAT for AI Agents: Benchmarks and Frameworks

What Is AI Deflection (and How to Measure It)

Why "Human-in-the-Loop" Beats "Fully Autonomous" for Most Teams

Voice AI, twice a month.