Latency Engineering for Real-Time Voice Agents
Latency is what separates voice agents that feel conversational from those that feel broken. Humans expect responses within 700ms of finishing a sentence — anything longer triggers a "did they hear me?" reaction. Sub-500ms feels alive. Sub-300ms feels exceptional.
Latency is what separates voice agents that feel conversational from those that feel broken. Humans expect responses within 700ms of finishing a sentence — anything longer triggers a "did they hear me?" reaction. Sub-500ms feels alive. Sub-300ms feels exceptional. Getting there requires deliberate engineering across the entire pipeline, from audio capture to TTS playback. This piece covers the practical latency engineering for production voice agents.
TL;DR
- Target: sub-500ms median round-trip; sub-800ms p95.
- Budget breakdown: STT (50-150ms), LLM (100-400ms), TTS (100-250ms), network overhead.
- Stream everything; don't wait for one stage to finish before starting the next.
- Small, fast LLMs + streaming outputs win the latency race.
- Measure p50, p95, p99 — averages hide the problems.
The latency budget
Target breakdown for a sub-500ms round-trip:
- Audio capture + VAD endpointing: 100-200ms.
- STT (final transcript): 50-150ms after endpoint.
- LLM first-token: 100-300ms.
- TTS first-audio: 100-200ms.
- Network round-trips: 30-80ms.
Overlapping stages reduce total. With heavy streaming: sub-500ms is achievable.
Where latency actually lives
Audio capture. Minimal — sub-20ms.
VAD endpointing. Determines when caller finished speaking. Typically 100-300ms. Aggressive tuning can go lower but risks cutting off caller mid-sentence.
STT. Modern streaming STT emits partials during speech; final transcript arrives ~50-150ms after endpoint.
LLM inference. Time to first token. Depends on model size and provider.
TTS first audio. Time from text input to first audio chunk. 100-200ms for streaming TTS.
TTS synthesis. Continues during playback. Doesn't block first audio.
Network. RTT per hop. US coast-to-coast ~70ms; intra-region under 20ms.
The streaming principle
Don't wait for each stage to finish:
- STT streams partials → LLM receives and processes as they arrive.
- LLM streams tokens → TTS starts synthesizing first sentences while LLM continues generating.
- TTS streams audio chunks → caller hears beginning while rest is synthesized.
Overlap is the secret.
See streaming LLM outputs to voice: the engineering, streaming TTS: how to cut first-audio latency, streaming STT: how to cut recognition latency.
VAD / endpointing tradeoffs
Aggressive endpointing:
- Pros: faster response.
- Cons: cuts off callers mid-sentence.
Conservative endpointing:
- Pros: rarely cuts caller off.
- Cons: feels slow.
Tune per use case. For casual conversation, aggressive. For information-heavy (dictation, data collection), conservative.
See voice activity detection in production voice agents.
LLM latency optimization
Biggest variable. Strategies:
Use smaller models. 8B parameters for turn-level decisions. GPT-4o class only for hard reasoning moments.
Prompt optimization. Shorter prompts = faster processing.
Streaming outputs. First token in 150-300ms beats full generation in 800ms.
Locally-hosted vs API. Local reduces API round-trip. Only practical at scale.
Region-matched. LLM region near voice AI region.
See why smaller LLMs often win for voice agents.
TTS latency optimization
Streaming TTS mandatory. Non-streaming is DOA for voice agents.
First-audio latency. Time from text input to first audio chunk. Sub-200ms target.
Model choice. Cartesia and Deepgram Aura lead on latency. Simba premium for quality; slightly higher latency.
Caching. Pre-synthesize common phrases (greetings, goodbyes).
See streaming TTS: how to cut first-audio latency.
Network architecture
Voice agents span multiple services:
- STT provider (often separate).
- LLM provider (often separate).
- TTS provider (often separate).
- Orchestration layer.
- Telephony.
Each hop adds latency. Minimize by:
- Co-locating services in same cloud region.
- Using the same provider for multiple stages when possible.
- Direct provider-to-provider connections where available.
Measuring
Measure what matters:
- Time to first word (TTFW) — time from caller endpoint to first audio.
- P50 / p95 / p99 — not average.
- Per-stage breakdown.
- Over time — trending.
Don't rely on vendor benchmarks. Measure in your environment.
Tools for measurement
- Custom instrumentation in your stack.
- Vendor metrics (Twilio Voice Insights, etc.).
- End-to-end testing harness — simulated calls with known content.
Common latency killers
Sequential processing. Waiting for STT to finish before starting LLM. Always stream.
Non-streaming LLM. Waiting for full response before TTS. Always stream tokens.
Non-streaming TTS. Waiting for full audio before playback. Always stream.
Cross-region calls. STT in US-East, LLM in US-West. Add ~70ms.
Cold starts. First call after idle hits slow path. Warm up.
Chatty prompts. Long system prompts take longer to process.
Unnecessary function calls. LLM calls a function mid-response → adds hundreds of ms.
First-call vs steady-state
First call after idle: typically 200-500ms slower.
- Model loading.
- Connection establishment.
- Cache misses.
Keep-alives and warm pools mitigate.
Long-call latency
Sometimes latency degrades as call goes on:
- Context window filling up → LLM slower.
- Memory growing → GC pauses.
Monitor. Mitigate with conversation summarization (condense old turns).
Quality-latency tradeoff
- Smaller LLM = faster but less capable.
- Aggressive endpointing = faster but can cut off.
- Budget TTS = faster but less natural.
Pick balances per use case.
The sub-300ms frontier
Leading deployments in 2026 hit sub-300ms:
- 8B-class LLMs with quick first-token.
- Cartesia / Deepgram TTS.
- Aggressive streaming.
- Co-located infrastructure.
Achievable with engineering work.
Production monitoring
- Daily p50/p95/p99 tracking.
- Alert on regressions.
- Per-deployment comparison.
- Drill into outliers.
Common pitfalls
Tracking averages. Average looks fine; p95 is brutal. Callers notice p95.
Vendor-reported latency. Measured in lab. Yours may be 2x.
Ignoring endpointing. STT could be instant; VAD adds 300ms.
Non-streaming somewhere. Any stage non-streaming kills the whole pipeline.
Not testing cold starts. Works in dev; breaks first prod call.
FAQ
What's "acceptable" latency? Under 800ms — usable. Under 500ms — good. Under 300ms — exceptional.
Does streaming help if my LLM is slow? Yes — first token in 200ms beats full response in 1s.
How do we benchmark end-to-end? Simulated calls with known content; measure first-audio time.
Can we reduce LLM latency? Smaller model, prompt compression, streaming, region-matched.
When does latency hurt conversion? Above 1 second, measurable drop. Above 1.5s, significant.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all →Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport — lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
Echo Cancellation in Real-Time Voice AI
Echo in voice agent calls sounds like this: agent starts speaking, caller's speaker plays agent's voice, caller's microphone picks up agent's voice, the audio flows back to the agent, agent's STT transcribes its own speech, agent gets confused, conversation breaks down.
The Engineering Behind Sub-Second Voice Agents
Sub-second voice agents — end-to-end latency under 1000ms from caller speech end to agent speech start — used to be aspirational. In 2026 it's table stakes for production voice AI, and leading deployments are hitting sub-500ms.
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
