๐Ÿง  Conversational AI & LLMs

LLM Evaluation for Conversational Agents

You can't tune what you can't measure. Evaluation is the unsexy work that separates voice agent teams shipping production-quality work from teams flying blind. Most teams underinvest here for the first few months, then have a wake-up moment when something breaks.

Tyler Weitzman
Tyler Weitzman
January 16, 2026 ยท 6 min read
Speechify

You can't tune what you can't measure. Evaluation is the unsexy work that separates voice agent teams shipping production-quality work from teams flying blind. Most teams underinvest here for the first few months, then have a wake-up moment when something breaks. This is the eval setup that keeps you from getting there.

TL;DR

  • Evaluate at three levels: turn-level, call-level, and outcome-level.
  • Combine human grading (hardest, highest signal) with automated grading (cheaper, scales).
  • Eval set should be 100โ€“500 calls, refreshed quarterly.
  • The single best practice: evaluate every prompt change before shipping.

Why evals are different for voice

Voice evals have three constraints that text evals don't:

Audio quality matters. A turn where the agent said the right thing but the audio glitched is still a bad turn.

Latency matters. A response that's correct but slow is worse than one that's faster but slightly less complete.

Conversational rhythm matters. Did the agent stop when interrupted? Was there awkward dead air? Did it bridge slow operations?

Pure text evaluation misses these.

Three levels of evaluation

Turn-level

Score each turn against a rubric. Useful for diagnosing specific failures.

Rubric example (1โ€“5 each):

  • Did the agent understand the user's intent?
  • Was the response factually correct?
  • Was the tone appropriate?
  • Was the latency acceptable?

Aggregate by averaging across turns and calls.

Call-level

Score the call as a whole. Useful for measuring user-facing quality.

Rubric:

  • Did the call achieve the caller's goal?
  • Did the call escalate appropriately?
  • Would the caller be satisfied?

Outcome-level

Did the actual business outcome happen? This is the most expensive to measure but the most meaningful.

Examples:

  • For booking agents: did the appointment get kept?
  • For sales agents: did the lead convert?
  • For support agents: did the customer call back about the same issue?

You only get outcome data 24+ hours later, so this is a slower feedback loop.

Human grading

The gold standard. Have a human listen to or read 20โ€“50 calls per week and score them on the rubric.

Practical setup:

  • A spreadsheet or simple internal tool with the rubric questions.
  • A shared queue of "calls to grade this week."
  • A weekly meeting to review the scores and any concerning patterns.

The grader should be someone who knows the use case well โ€” ideally a domain expert, not a developer. Tone judgment requires context.

LLM-as-judge

For scale, use an LLM to grade calls automatically. Prompt the judge LLM with:

  • The call transcript
  • The rubric
  • A request to score each criterion with brief justification

Pros: scales to thousands of calls. Cons: slightly noisier than human grading; can miss subtle issues.

Best practice: use LLM-as-judge for the bulk + human grading for a sample. They calibrate each other.

Building an eval set

A good eval set has 100โ€“500 calls covering:

  • Common happy paths (40โ€“60%)
  • Common failure modes (15โ€“25%)
  • Edge cases you've seen in production (10โ€“20%)
  • Calls where escalation was correct (5โ€“10%)
  • Calls where escalation was wrong (5โ€“10%)

Refresh quarterly. Add 10โ€“20 new calls per week to keep up with new patterns.

Store your eval set as call transcripts (not audio) โ€” this lets you replay them through prompt changes without rerunning the audio pipeline.

A/B testing prompts

The eval workflow for prompt changes:

  1. Save the current prompt (version A).
  2. Edit the prompt (version B).
  3. Replay the eval set through both versions.
  4. For each call, score both versions on the rubric.
  5. Compare aggregate scores; investigate any divergence.
  6. Ship version B only if it's better on average AND no critical regressions.

Most prompt changes look like wins on the use cases you tested manually but introduce regressions on adjacent cases. The eval set catches this.

For more, see how to A/B test voice agent prompts.

What automated metrics actually measure

A few that are worth tracking:

Function-call accuracy. Did the agent call the right function with the right arguments? Easy to automate; high signal.

Resolution rate. Did the call complete its goal? Sometimes automatable from the function-call log.

Escalation rate. What percentage of calls escalated? Useful as a leading indicator; doesn't tell you if escalation was correct.

Latency stats. P50, P95, P99 of response time. Always automate.

Cost per call. Tracked from logs.

What's hard to automate:

  • Conversational tone
  • Whether the agent made the user feel heard
  • Whether the agent's reply was the best possible (vs just acceptable)

These need humans.

Common eval mistakes

Three patterns to avoid:

Eval set drift. Your eval set was built 6 months ago; the use case has evolved; you're optimizing for old patterns.

Cherry-picking metrics. "Resolution rate is up 5%!" while CSAT is down 0.5 points and escalations of legitimate issues dropped โ€” you've made the agent worse and called it better.

No eval before shipping. Changes go live based on "it worked when I tested it manually." Production has more variance than your manual tests.

What "good enough" looks like

Reasonable eval cadence for a production agent:

  • Weekly: 30 calls human-graded; 200 LLM-graded.
  • Monthly: full eval set (200โ€“500 calls) replayed through current prompt.
  • On every prompt change: full eval set replayed.
  • Quarterly: refresh eval set with new examples.

Total cost: roughly 4 hours of human time per week per agent + LLM eval costs (~$5โ€“$20/run).

FAQ

How big should my eval set be? 100 calls minimum, 500 ideal. Below 100 you don't catch regressions reliably.

Can I evaluate without grading every call? Yes โ€” random sampling is fine. The point is to catch patterns, not score every interaction.

Should I publish my eval rubric? Internally, yes. Externally, depends on competitive considerations. Most teams keep it internal.

Can I evaluate latency separately? Yes โ€” latency evals are cheap to automate and run continuously. Quality evals run on a slower cadence.

What's the difference between an eval and a regression test? Evals score quality (1โ€“5 per criterion); regression tests check specific behaviors ("when the user says X, the agent should call function Y"). Use both.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ€” text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all โ†’

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub โ€” new articles, trend notes, and operator guides. No spam.