🧠 Conversational AI & LLMs

How to A/B Test Voice Agent Prompts

Most teams don't A/B test voice agent prompts. They tweak the prompt, listen to a few calls, and ship if it "feels better." This works until it doesn't — until a tweak that helps one use case silently breaks another.

Tyler Weitzman
Tyler Weitzman
January 21, 2026 · 5 min read
Speechify

Most teams don't A/B test voice agent prompts. They tweak the prompt, listen to a few calls, and ship if it "feels better." This works until it doesn't — until a tweak that helps one use case silently breaks another. A real A/B testing workflow catches the regressions before they reach customers.

TL;DR

  • The eval workflow: replay your eval set through both prompt versions; compare scores.
  • Don't ship a prompt change that wins on average but causes critical regressions on specific cases.
  • For live A/B tests, route 10–20% of traffic to the new prompt; compare CSAT and resolution rate.
  • Eval set quality matters more than eval volume.

Two flavors of A/B testing

Offline (replay). Take your eval set of historical call transcripts; run each through both prompt versions; score the resulting agent behavior. Cheap, fast, controlled.

Online (live). Route a fraction of real calls to each variant; collect outcome metrics over days/weeks. Expensive, slower, real-world.

Use offline for nearly every change. Use online when offline can't capture what you care about (long-term outcomes, customer perception).

Building an eval set

Cover what matters:

  • 30–50 happy-path calls.
  • 15–25 calls with common failure modes.
  • 10–20 calls with edge cases you've seen.
  • 5–10 calls where escalation was correct.
  • 5–10 calls where escalation was wrong.

Total: 100–200 calls. Refresh quarterly. Add 10–20 new calls per week.

Store as text transcripts (with timing if available). You don't need audio — replaying the LLM behavior on the transcript is enough for most prompt changes.

The replay workflow

For each call in the eval set:

  1. Reconstruct the conversation up to a specific turn.
  2. Run the LLM with prompt version A; record the reply.
  3. Run the LLM with prompt version B; record the reply.
  4. Score both replies on your rubric.
  5. Compare aggregate scores.

For longer calls, you might pick 2–3 turns per call to evaluate (not every turn).

The rubric

Keep it tight. Five criteria max:

  1. Correctness. Did the agent give correct info / call the right tools? (1–5)
  2. Tone. Did the agent sound appropriate for the brand? (1–5)
  3. Concision. Was the reply appropriately brief for voice? (1–5)
  4. Recovery. If there was an error or unclear input, did the agent handle it well? (1–5)
  5. Escalation. If escalation was needed, did the agent escalate correctly? (1–5)

Score each turn (or each call) on each criterion. Aggregate.

The decision rule

A prompt change ships if:

  • Average score on the eval set is at least as good as the current prompt.
  • No critical regressions: no individual call where the new prompt scored 2+ points lower than the current.
  • Latency hasn't regressed (changes that add prompt tokens slow down TTFT).

If average is up but you have a critical regression on a specific call, fix the regression before shipping.

LLM-as-judge

For scale, use an LLM to score the eval. Prompt:

You are evaluating a voice agent's reply.
Conversation context: [transcript so far]
Agent reply: [the candidate reply]

Score 1-5 on each:
- Correctness: ...
- Tone: ...
- (etc.)

Return JSON: { correctness: N, tone: N, ... }

Run the same eval set through both versions; collect scores; compare.

LLM-as-judge is noisier than human grading but scales. Best practice: validate against human grading on 10% of cases.

Live A/B testing

For changes that offline eval can't capture (long-term outcomes, real customer perception), run a live A/B:

  1. Route 10–20% of calls to the new prompt; the rest to the current.
  2. Tag each call with the variant.
  3. Wait 1–2 weeks.
  4. Compare:
    • Resolution rate
    • Escalation rate
    • CSAT
    • Average handle time
    • Cost per call

Decision: ship the variant if it wins on the metrics you care about and doesn't lose on the others.

What goes wrong

Common A/B testing mistakes:

Eval set drift. Your eval set was built 6 months ago; current customers ask different things; you're optimizing for old patterns.

Cherry-picking. "Resolution went up!" while CSAT went down. Track multiple metrics.

Underpowered tests. A 100-call A/B test won't detect a 5% improvement reliably. Either accept noisy signal or run longer.

Confounders. Volume changed, customer mix changed, holiday traffic. Pick a stable period for the test.

Overfitting. Tuning the prompt against the eval set until it scores 95% but fails on production calls. Hold out 20% of the eval set as a true test.

When to skip A/B testing

A few cases where it's overkill:

  • Cosmetic prompt changes (typo fixes, renaming variables).
  • Adding a clearly defensive rule (e.g., "never give medical advice").
  • Adding a new tool that the agent will use only when the new feature is invoked.

For non-cosmetic changes, run the eval. The discipline is worth it.

Automation

Three things worth automating:

Eval execution. A script that runs your eval set through any prompt version and outputs scores.

Diff visualization. A view showing per-call score deltas between versions.

Regression alerts. Tag any call where the new prompt scored significantly worse than the current.

Most platforms include some version of this. If yours doesn't, build it.

FAQ

How big should my eval set be? 100 minimum. 200–500 ideal. Below 100 you can't catch subtle regressions.

How long should a live A/B test run? Until you have at least 500 calls per variant (1–2 weeks for moderate volume). Longer if you're tracking outcome metrics.

Can I A/B test multiple prompts at once? Statistically harder but doable. Most teams stick to A/B (two variants) for simplicity.

Should I always ship the winning variant? Yes — but verify the regression check first. Average wins can hide critical losses.

What's the most common prompt change? Tightening recovery language ("sorry — let me try that again") and adding voice style rules. Both usually win.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all →

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.