How to Benchmark a Voice Agent's End-to-End Latency
Vendor-reported latency is a lab number. What matters for your voice agent is measured latency in your production environment, under real network conditions, with your actual content.
Vendor-reported latency is a lab number. What matters for your voice agent is measured latency in your production environment, under real network conditions, with your actual content. Benchmarking that number is not hard, but it's easy to do badly โ and operators who skip it are flying blind when quality issues surface. This piece covers how to benchmark end-to-end latency properly.
TL;DR
- Measure time-to-first-word (TTFW) from caller endpoint to agent audio.
- Use synthetic test harnesses with known timing.
- Track p50, p95, p99 โ not averages.
- Measure per-stage breakdown for debugging.
- Run continuously; alert on regressions.
The primary metric
Time to First Word (TTFW).
- Caller finishes speaking.
- Time until caller hears first agent audio.
- Primary latency metric for voice agents.
Target: sub-500ms median.
The measurement
End-to-end flow:
- Caller finishes speaking (t=0).
- VAD detects silence (t=~500ms due to endpoint threshold).
- STT finalizes transcript (t=~600ms).
- LLM generates first token (t=~800ms).
- TTS receives first sentence (t=~850ms).
- TTS emits first audio chunk (t=~950ms).
- Audio plays to caller (t=~1000ms).
TTFW in this example: ~1000ms.
Optimize each stage to reduce.
Per-stage breakdown
Instrument each stage:
- Endpoint detection latency. Time from caller silence to VAD endpoint.
- STT finalization. Time from endpoint to final transcript.
- LLM first token. Time from transcript to first response token.
- LLM to TTS. Time from first token to first sentence sent.
- TTS first audio. Time from input to first audio.
- Audio playback begin. Time from first audio to caller heard.
Add them up; optimize longest.
Synthetic testing
Build a test harness:
- Feed pre-recorded audio to the voice agent.
- Audio has a known "end" marker (silent moment).
- Record audio output.
- Measure time from marker to first audio in output.
Reproducible, fast, automated.
Real-world measurement
Synthetic tests are lab; real calls are production:
- Sample real calls.
- Manually annotate caller end timing.
- Measure to first agent audio.
- Aggregate.
Harder to automate but truest signal.
Distribution, not average
Averages hide tail problems:
- Average 500ms โ looks fine.
- P95 1.2s โ terrible tail experience.
Track:
- P50: median. Typical experience.
- P95: what 5% of callers experience. Worst tolerable.
- P99: edge cases. Investigate these.
Time series
Latency shifts over time:
- Vendor model updates.
- Infrastructure changes.
- Prompt changes.
- Traffic patterns.
Daily charts. Alert on regressions.
Alerting
Set thresholds:
- P50 over 600ms: warning.
- P95 over 1000ms: critical.
- Sudden jump > 20%: investigate.
Page on-call for critical regressions.
Network variability
Real networks have jitter. Measurements should:
- Run from multiple geographic points.
- Include various network types (WiFi, cellular).
- Repeat over time of day.
Your metrics should reflect user diversity.
Cold vs warm
First-call-after-idle latency often higher. Measure both:
- Cold start p50 / p95.
- Warm (within active session) p50 / p95.
Optimize with warmup / keep-alive.
Vendor benchmark validation
When vendor says "sub-300ms":
- Measure in your environment.
- Compare to their claim.
- Investigate discrepancies.
Often vendor benchmarks are best-case; yours may be 20-50% higher.
Instrumentation
Code-level:
- Log timestamp at each stage.
- Computed latency per call.
- Aggregated and reported.
Frameworks (Pipecat, LiveKit) may expose; custom orchestration requires manual.
Dashboards
- Overview: TTFW p50/p95/p99 trend.
- By stage: where's latency coming from?
- By geography: regional variance.
- By time of day: load effects.
- Regression alerts: recent changes.
A/B testing latency
When evaluating a change (new TTS vendor, prompt update, model swap):
- Run parallel A/B.
- Measure TTFW for each.
- Statistical significance.
- Pick winner.
The trade-off matrix
Some optimizations trade latency for quality:
- Smaller LLM: faster, worse quality.
- Aggressive VAD: faster, cuts off callers.
- Budget TTS: faster, less natural.
Measure both dimensions.
End-to-end including callers' networks
Full user experience:
- Caller's network RTT to your infra.
- Add 30-100ms.
Some measure this; others measure server-side only.
Benchmark script example
import time
import asyncio
async def benchmark():
# Place call with known audio
call = await place_call_with_audio("test_utterance_ends_with_silence.wav")
end_of_caller_speech_timestamp = get_audio_end_ts(call.input_audio)
first_audio_out_timestamp = get_first_audio_ts(call.output_audio)
ttfw = first_audio_out_timestamp - end_of_caller_speech_timestamp
return ttfw
# Run 100 iterations; aggregate
results = [await benchmark() for _ in range(100)]
p50 = np.percentile(results, 50)
p95 = np.percentile(results, 95)
Budget allocation
For a 500ms TTFW target:
- VAD endpoint: 100-150ms.
- STT finalize: 50-100ms.
- LLM first token: 150-250ms.
- TTS first audio: 100-150ms.
- Network overhead: 50ms.
Fit within budget per stage.
Common pitfalls
Averaging only. Missing tail problems.
Synthetic-only testing. Lab conditions; real calls differ.
No per-stage breakdown. Can't diagnose regressions.
Ignoring cold starts. First call experience matters.
No alerting. Regressions go unnoticed.
Vendor benchmark trust. Their numbers; your reality.
Related reading
- Latency Engineering for Real-Time Voice Agents
- Streaming Audio Over WebRTC for Voice Agents
- Echo Cancellation in Real-Time Voice AI
- How Sample Rate Affects Voice Agent Quality
- The Engineering Behind Sub-Second Voice Agents
FAQ
How often should we benchmark? Continuously. Daily reports; real-time alerts.
What's a good TTFW target? Under 500ms median; under 1000ms p95.
Does latency affect conversion? Measurably. Every 100ms above 500ms hurts.
Can we rely on vendor dashboards? For their portion only. You own end-to-end.
What about multi-turn latency? Measure each turn. Later turns should be same as first (if not, investigate context growth).

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all โOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport โ lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
How Sample Rate Affects Voice Agent Quality
Sample rate is one of those low-level audio details that voice agent builders often inherit without thinking about. The STT config says 16 kHz; the TTS outputs 24 kHz; the PSTN leg is 8 kHz.
Echo Cancellation in Real-Time Voice AI
Echo in voice agent calls sounds like this: agent starts speaking, caller's speaker plays agent's voice, caller's microphone picks up agent's voice, the audio flows back to the agent, agent's STT transcribes its own speech, agent gets confused, conversation breaks down.
Voice AI, twice a month.
Get the best of the SIMBA resources hub โ new articles, trend notes, and operator guides. No spam.
