LLM Evaluation for Conversational Agents
You can't tune what you can't measure. Evaluation is the unsexy work that separates voice agent teams shipping production-quality work from teams flying blind. Most teams underinvest here for the first few months, then have a wake-up moment when something breaks.
You can't tune what you can't measure. Evaluation is the unsexy work that separates voice agent teams shipping production-quality work from teams flying blind. Most teams underinvest here for the first few months, then have a wake-up moment when something breaks. This is the eval setup that keeps you from getting there.
TL;DR
- Evaluate at three levels: turn-level, call-level, and outcome-level.
- Combine human grading (hardest, highest signal) with automated grading (cheaper, scales).
- Eval set should be 100โ500 calls, refreshed quarterly.
- The single best practice: evaluate every prompt change before shipping.
Why evals are different for voice
Voice evals have three constraints that text evals don't:
Audio quality matters. A turn where the agent said the right thing but the audio glitched is still a bad turn.
Latency matters. A response that's correct but slow is worse than one that's faster but slightly less complete.
Conversational rhythm matters. Did the agent stop when interrupted? Was there awkward dead air? Did it bridge slow operations?
Pure text evaluation misses these.
Three levels of evaluation
Turn-level
Score each turn against a rubric. Useful for diagnosing specific failures.
Rubric example (1โ5 each):
- Did the agent understand the user's intent?
- Was the response factually correct?
- Was the tone appropriate?
- Was the latency acceptable?
Aggregate by averaging across turns and calls.
Call-level
Score the call as a whole. Useful for measuring user-facing quality.
Rubric:
- Did the call achieve the caller's goal?
- Did the call escalate appropriately?
- Would the caller be satisfied?
Outcome-level
Did the actual business outcome happen? This is the most expensive to measure but the most meaningful.
Examples:
- For booking agents: did the appointment get kept?
- For sales agents: did the lead convert?
- For support agents: did the customer call back about the same issue?
You only get outcome data 24+ hours later, so this is a slower feedback loop.
Human grading
The gold standard. Have a human listen to or read 20โ50 calls per week and score them on the rubric.
Practical setup:
- A spreadsheet or simple internal tool with the rubric questions.
- A shared queue of "calls to grade this week."
- A weekly meeting to review the scores and any concerning patterns.
The grader should be someone who knows the use case well โ ideally a domain expert, not a developer. Tone judgment requires context.
LLM-as-judge
For scale, use an LLM to grade calls automatically. Prompt the judge LLM with:
- The call transcript
- The rubric
- A request to score each criterion with brief justification
Pros: scales to thousands of calls. Cons: slightly noisier than human grading; can miss subtle issues.
Best practice: use LLM-as-judge for the bulk + human grading for a sample. They calibrate each other.
Building an eval set
A good eval set has 100โ500 calls covering:
- Common happy paths (40โ60%)
- Common failure modes (15โ25%)
- Edge cases you've seen in production (10โ20%)
- Calls where escalation was correct (5โ10%)
- Calls where escalation was wrong (5โ10%)
Refresh quarterly. Add 10โ20 new calls per week to keep up with new patterns.
Store your eval set as call transcripts (not audio) โ this lets you replay them through prompt changes without rerunning the audio pipeline.
A/B testing prompts
The eval workflow for prompt changes:
- Save the current prompt (version A).
- Edit the prompt (version B).
- Replay the eval set through both versions.
- For each call, score both versions on the rubric.
- Compare aggregate scores; investigate any divergence.
- Ship version B only if it's better on average AND no critical regressions.
Most prompt changes look like wins on the use cases you tested manually but introduce regressions on adjacent cases. The eval set catches this.
For more, see how to A/B test voice agent prompts.
What automated metrics actually measure
A few that are worth tracking:
Function-call accuracy. Did the agent call the right function with the right arguments? Easy to automate; high signal.
Resolution rate. Did the call complete its goal? Sometimes automatable from the function-call log.
Escalation rate. What percentage of calls escalated? Useful as a leading indicator; doesn't tell you if escalation was correct.
Latency stats. P50, P95, P99 of response time. Always automate.
Cost per call. Tracked from logs.
What's hard to automate:
- Conversational tone
- Whether the agent made the user feel heard
- Whether the agent's reply was the best possible (vs just acceptable)
These need humans.
Common eval mistakes
Three patterns to avoid:
Eval set drift. Your eval set was built 6 months ago; the use case has evolved; you're optimizing for old patterns.
Cherry-picking metrics. "Resolution rate is up 5%!" while CSAT is down 0.5 points and escalations of legitimate issues dropped โ you've made the agent worse and called it better.
No eval before shipping. Changes go live based on "it worked when I tested it manually." Production has more variance than your manual tests.
What "good enough" looks like
Reasonable eval cadence for a production agent:
- Weekly: 30 calls human-graded; 200 LLM-graded.
- Monthly: full eval set (200โ500 calls) replayed through current prompt.
- On every prompt change: full eval set replayed.
- Quarterly: refresh eval set with new examples.
Total cost: roughly 4 hours of human time per week per agent + LLM eval costs (~$5โ$20/run).
Related reading
- How Large Language Models Power Voice Agents
- Designing Voice Agents That Ask Better Questions
- Open-Source vs Closed-Source LLMs for Voice Agents
- How LLMs Decide What to Say Next in a Voice Conversation
- Building a Conversation Memory Layer for Voice Agents
FAQ
How big should my eval set be? 100 calls minimum, 500 ideal. Below 100 you don't catch regressions reliably.
Can I evaluate without grading every call? Yes โ random sampling is fine. The point is to catch patterns, not score every interaction.
Should I publish my eval rubric? Internally, yes. Externally, depends on competitive considerations. Most teams keep it internal.
Can I evaluate latency separately? Yes โ latency evals are cheap to automate and run continuously. Quality evals run on a slower cadence.
What's the difference between an eval and a regression test? Evals score quality (1โ5 per criterion); regression tests check specific behaviors ("when the user says X, the agent should call function Y"). Use both.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all โOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
How to A/B Test Voice Agent Prompts
Most teams don't A/B test voice agent prompts. They tweak the prompt, listen to a few calls, and ship if it "feels better." This works until it doesn't โ until a tweak that helps one use case silently breaks another.
Designing Voice Agents That Ask Better Questions
A voice agent that asks bad questions wastes the caller's time and produces bad data. Good questions feel natural and capture what you need in fewer turns.
Open-Source vs Closed-Source LLMs for Voice Agents
The open-source LLM ecosystem caught up to closed models faster than anyone expected. Llama 3.3, Mistral, Qwen โ all good enough for most voice agent use cases.
Voice AI, twice a month.
Get the best of the SIMBA resources hub โ new articles, trend notes, and operator guides. No spam.
