The Engineering Behind Sub-Second Voice Agents
Sub-second voice agents — end-to-end latency under 1000ms from caller speech end to agent speech start — used to be aspirational. In 2026 it's table stakes for production voice AI, and leading deployments are hitting sub-500ms.
Sub-second voice agents — end-to-end latency under 1000ms from caller speech end to agent speech start — used to be aspirational. In 2026 it's table stakes for production voice AI, and leading deployments are hitting sub-500ms. The engineering to get there involves streaming at every layer, co-location of services, aggressive model selection, and careful pipeline orchestration. This piece covers what it takes, where the gains are, and the tradeoffs you make.
TL;DR
- Sub-500ms median is achievable with streaming + co-location + fast models.
- Sub-300ms is the new frontier, reserved for top-tier deployments.
- The budget: STT 50-150ms, LLM 100-400ms, TTS 100-250ms, overheads.
- Stream everything; co-locate where possible; pick fast models for routine turns.
- Measure p50, p95, p99 separately.
The target
- 1000ms: "works but feels slow."
- 800ms: "acceptable."
- 500ms: "feels conversational."
- 300ms: "indistinguishable from human."
Most 2026 deployments target 500-700ms median.
The layers to optimize
1. Audio capture and VAD. Client-side; minimal optimization once working.
2. Endpointing. How fast agent decides caller is done. Biggest single lever.
3. STT. Streaming; final transcript ~50-150ms after endpoint.
4. LLM. Biggest variable. Model size, prompt length, infrastructure matter.
5. TTS. Streaming; first audio ~100-250ms.
6. Network. Co-location matters; cross-region adds tens of ms.
Endpointing — the often-overlooked win
Default endpointing is often conservative:
- 800ms silence = end of utterance.
Tune:
- 500ms for routine conversation.
- 400ms for decisive, fast-paced use cases.
- 700-900ms when callers need to think.
Every 100ms shaved off endpointing = 100ms faster response.
See voice activity detection in production voice agents.
STT optimizations
- Streaming. Partials available; finalize fast.
- Fast models. Deepgram Nova-3, Cartesia.
- Domain vocabulary biasing. Not just for WER — also shortens recognition time by pruning search space.
- Reasonable endpointing. STT's endpoint detection can be slower than client-side VAD.
LLM optimizations
Biggest latency contributor. Strategies:
Small fast model for routine. 8B-class model handles 80%+ of turns. Fast.
Frontier model for complex. Escalate to GPT-4o / Claude for hard reasoning moments only.
Prompt compression. Shorter prompts = faster processing. Trim boilerplate.
Streaming output. First token 150-300ms; TTS starts early.
Co-located inference. Self-hosted or provider in same region. Saves 50-80ms.
See why smaller LLMs often win for voice agents.
TTS optimizations
- Streaming TTS. Mandatory.
- Fast models. Cartesia, Deepgram Aura.
- First-audio latency under 200ms. Primary metric.
- Cache common phrases. Greetings, goodbyes.
- Sentence-boundary sends. Stream LLM tokens into TTS at sentence ends.
See streaming TTS: how to cut first-audio latency.
Network architecture
- Co-locate services. All in same cloud region.
- Direct provider connections. Where available.
- Edge presence. For international calls.
- Persistent connections. Keep WebSockets open; avoid reconnect.
Co-location math
Cross-country US: ~70ms RTT. Intra-region (e.g., us-east-1): under 20ms. Same AZ: under 5ms.
Every service hop adds. Co-locating STT + LLM + TTS saves 100-200ms of accumulated round-trips.
The 300ms frontier
Achievable by:
- 8B LLM co-hosted.
- Cartesia TTS streaming.
- Aggressive endpointing (500ms).
- Deepgram Nova streaming.
- Same-region infrastructure.
- Pre-generated common phrases.
Requires specific engineering investment but credible.
Monitoring
- TTFW (time to first word). End-to-end from caller finish to agent audio.
- Per-stage latency. STT, LLM, TTS separately.
- P50, p95, p99. Distribution matters.
- Trend. Is it regressing over time?
Common sources of regression
Prompt growth. Add one more paragraph → 50ms slower.
Model change. Vendor swaps model; latency changes.
Infrastructure migration. Region change; cross-region calls.
Cold-start patterns. Traffic patterns shift; more cold starts.
Monitor continuously.
The barge-in challenge
Interruption (caller barges in during agent's speech) requires:
- Fast VAD detection of caller speech.
- Fast stop of TTS playback.
- Fast pivot to processing new input.
Sub-200ms from caller start to TTS stop is the target.
See turn-taking and barge-in: the mechanics of natural conversation.
Cost tradeoffs
Lower latency often costs more:
- Fast models: sometimes more expensive.
- Co-located infra: operational overhead.
- Premium TTS: higher per-minute.
- Multiple fallback paths: redundancy cost.
For consumer-facing deployments, latency pays for itself in conversion.
When to stop optimizing
Sub-500ms reliably achieved:
- Marginal gains get expensive.
- User impact diminishes (500ms is already good).
- Focus might be better on conversation quality.
First-call performance
First call after idle often slower:
- Model loading.
- Connection establishment.
- Cache misses.
Keep-alive / warmup mitigates.
Long-call degradation
After 10+ minutes, some deployments slow:
- Context window filling.
- Memory pressure.
- GC pauses.
Summarize conversation periodically.
Common pitfalls
Not measuring p95/p99. Average is fine; tail is brutal.
Vendor-reported benchmarks. Measure yourself.
One stage non-streaming. Any non-streaming link kills the pipeline.
Large prompts. Every token matters.
Cold paths in production. Edge cases hit slow paths.
FAQ
Can we hit sub-300ms reliably? Yes with effort. Requires engineering investment.
What's the latency floor? Physics + compute. Around 150-200ms minimum for current architectures.
Does streaming matter if my pipeline is fast? Yes. Compounds on top.
What about users on slow networks? Degrades gracefully; transport overhead increases.
How often should we re-benchmark? Monthly baseline; alert on regressions real-time.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all →Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport — lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
Echo Cancellation in Real-Time Voice AI
Echo in voice agent calls sounds like this: agent starts speaking, caller's speaker plays agent's voice, caller's microphone picks up agent's voice, the audio flows back to the agent, agent's STT transcribes its own speech, agent gets confused, conversation breaks down.
Latency Engineering for Real-Time Voice Agents
Latency is what separates voice agents that feel conversational from those that feel broken. Humans expect responses within 700ms of finishing a sentence — anything longer triggers a "did they hear me?" reaction. Sub-500ms feels alive. Sub-300ms feels exceptional.
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
