๐Ÿ”Š Speech Technology

How to Benchmark a Voice Agent's End-to-End Latency

Vendor-reported latency is a lab number. What matters for your voice agent is measured latency in your production environment, under real network conditions, with your actual content.

Tyler Weitzman
Tyler Weitzman
March 21, 2026 ยท 5 min read
Speechify

Vendor-reported latency is a lab number. What matters for your voice agent is measured latency in your production environment, under real network conditions, with your actual content. Benchmarking that number is not hard, but it's easy to do badly โ€” and operators who skip it are flying blind when quality issues surface. This piece covers how to benchmark end-to-end latency properly.

TL;DR

  • Measure time-to-first-word (TTFW) from caller endpoint to agent audio.
  • Use synthetic test harnesses with known timing.
  • Track p50, p95, p99 โ€” not averages.
  • Measure per-stage breakdown for debugging.
  • Run continuously; alert on regressions.

The primary metric

Time to First Word (TTFW).

  • Caller finishes speaking.
  • Time until caller hears first agent audio.
  • Primary latency metric for voice agents.

Target: sub-500ms median.

The measurement

End-to-end flow:

  1. Caller finishes speaking (t=0).
  2. VAD detects silence (t=~500ms due to endpoint threshold).
  3. STT finalizes transcript (t=~600ms).
  4. LLM generates first token (t=~800ms).
  5. TTS receives first sentence (t=~850ms).
  6. TTS emits first audio chunk (t=~950ms).
  7. Audio plays to caller (t=~1000ms).

TTFW in this example: ~1000ms.

Optimize each stage to reduce.

Per-stage breakdown

Instrument each stage:

  • Endpoint detection latency. Time from caller silence to VAD endpoint.
  • STT finalization. Time from endpoint to final transcript.
  • LLM first token. Time from transcript to first response token.
  • LLM to TTS. Time from first token to first sentence sent.
  • TTS first audio. Time from input to first audio.
  • Audio playback begin. Time from first audio to caller heard.

Add them up; optimize longest.

Synthetic testing

Build a test harness:

  1. Feed pre-recorded audio to the voice agent.
  2. Audio has a known "end" marker (silent moment).
  3. Record audio output.
  4. Measure time from marker to first audio in output.

Reproducible, fast, automated.

Real-world measurement

Synthetic tests are lab; real calls are production:

  • Sample real calls.
  • Manually annotate caller end timing.
  • Measure to first agent audio.
  • Aggregate.

Harder to automate but truest signal.

Distribution, not average

Averages hide tail problems:

  • Average 500ms โ€” looks fine.
  • P95 1.2s โ€” terrible tail experience.

Track:

  • P50: median. Typical experience.
  • P95: what 5% of callers experience. Worst tolerable.
  • P99: edge cases. Investigate these.

Time series

Latency shifts over time:

  • Vendor model updates.
  • Infrastructure changes.
  • Prompt changes.
  • Traffic patterns.

Daily charts. Alert on regressions.

Alerting

Set thresholds:

  • P50 over 600ms: warning.
  • P95 over 1000ms: critical.
  • Sudden jump > 20%: investigate.

Page on-call for critical regressions.

Network variability

Real networks have jitter. Measurements should:

  • Run from multiple geographic points.
  • Include various network types (WiFi, cellular).
  • Repeat over time of day.

Your metrics should reflect user diversity.

Cold vs warm

First-call-after-idle latency often higher. Measure both:

  • Cold start p50 / p95.
  • Warm (within active session) p50 / p95.

Optimize with warmup / keep-alive.

Vendor benchmark validation

When vendor says "sub-300ms":

  • Measure in your environment.
  • Compare to their claim.
  • Investigate discrepancies.

Often vendor benchmarks are best-case; yours may be 20-50% higher.

Instrumentation

Code-level:

  • Log timestamp at each stage.
  • Computed latency per call.
  • Aggregated and reported.

Frameworks (Pipecat, LiveKit) may expose; custom orchestration requires manual.

Dashboards

  • Overview: TTFW p50/p95/p99 trend.
  • By stage: where's latency coming from?
  • By geography: regional variance.
  • By time of day: load effects.
  • Regression alerts: recent changes.

A/B testing latency

When evaluating a change (new TTS vendor, prompt update, model swap):

  • Run parallel A/B.
  • Measure TTFW for each.
  • Statistical significance.
  • Pick winner.

The trade-off matrix

Some optimizations trade latency for quality:

  • Smaller LLM: faster, worse quality.
  • Aggressive VAD: faster, cuts off callers.
  • Budget TTS: faster, less natural.

Measure both dimensions.

End-to-end including callers' networks

Full user experience:

  • Caller's network RTT to your infra.
  • Add 30-100ms.

Some measure this; others measure server-side only.

Benchmark script example

import time
import asyncio

async def benchmark():
    # Place call with known audio
    call = await place_call_with_audio("test_utterance_ends_with_silence.wav")
    
    end_of_caller_speech_timestamp = get_audio_end_ts(call.input_audio)
    first_audio_out_timestamp = get_first_audio_ts(call.output_audio)
    
    ttfw = first_audio_out_timestamp - end_of_caller_speech_timestamp
    return ttfw

# Run 100 iterations; aggregate
results = [await benchmark() for _ in range(100)]
p50 = np.percentile(results, 50)
p95 = np.percentile(results, 95)

Budget allocation

For a 500ms TTFW target:

  • VAD endpoint: 100-150ms.
  • STT finalize: 50-100ms.
  • LLM first token: 150-250ms.
  • TTS first audio: 100-150ms.
  • Network overhead: 50ms.

Fit within budget per stage.

Common pitfalls

Averaging only. Missing tail problems.

Synthetic-only testing. Lab conditions; real calls differ.

No per-stage breakdown. Can't diagnose regressions.

Ignoring cold starts. First call experience matters.

No alerting. Regressions go unnoticed.

Vendor benchmark trust. Their numbers; your reality.

FAQ

How often should we benchmark? Continuously. Daily reports; real-time alerts.

What's a good TTFW target? Under 500ms median; under 1000ms p95.

Does latency affect conversion? Measurably. Every 100ms above 500ms hurts.

Can we rely on vendor dashboards? For their portion only. You own end-to-end.

What about multi-turn latency? Measure each turn. Later turns should be same as first (if not, investigate context growth).

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ€” text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all โ†’

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub โ€” new articles, trend notes, and operator guides. No spam.