Sub-second voice agents — end-to-end latency under 1000ms from caller speech end to agent speech start — used to be aspirational. In 2026 it's table stakes for production voice AI, and leading deployments are hitting sub-500ms. The engineering to get there involves streaming at every layer, co-location of services, aggressive model selection, and careful pipeline orchestration. This piece covers what it takes, where the gains are, and the tradeoffs you make.

TL;DR

Sub-500ms median is achievable with streaming + co-location + fast models.
Sub-300ms is the new frontier, reserved for top-tier deployments.
The budget: STT 50-150ms, LLM 100-400ms, TTS 100-250ms, overheads.
Stream everything; co-locate where possible; pick fast models for routine turns.
Measure p50, p95, p99 separately.

The target

1000ms: "works but feels slow."
800ms: "acceptable."
500ms: "feels conversational."
300ms: "indistinguishable from human."

Most 2026 deployments target 500-700ms median.

The layers to optimize

1. Audio capture and VAD. Client-side; minimal optimization once working.

2. Endpointing. How fast agent decides caller is done. Biggest single lever.

3. STT. Streaming; final transcript ~50-150ms after endpoint.

4. LLM. Biggest variable. Model size, prompt length, infrastructure matter.

5. TTS. Streaming; first audio ~100-250ms.

6. Network. Co-location matters; cross-region adds tens of ms.

Endpointing — the often-overlooked win

Default endpointing is often conservative:

800ms silence = end of utterance.

Tune:

500ms for routine conversation.
400ms for decisive, fast-paced use cases.
700-900ms when callers need to think.

Every 100ms shaved off endpointing = 100ms faster response.

See voice activity detection in production voice agents.

STT optimizations

Streaming. Partials available; finalize fast.
Fast models. Deepgram Nova-3, Cartesia.
Domain vocabulary biasing. Not just for WER — also shortens recognition time by pruning search space.
Reasonable endpointing. STT's endpoint detection can be slower than client-side VAD.

LLM optimizations

Biggest latency contributor. Strategies:

Small fast model for routine. 8B-class model handles 80%+ of turns. Fast.

Frontier model for complex. Escalate to GPT-4o / Claude for hard reasoning moments only.

Prompt compression. Shorter prompts = faster processing. Trim boilerplate.

Streaming output. First token 150-300ms; TTS starts early.

Co-located inference. Self-hosted or provider in same region. Saves 50-80ms.

See why smaller LLMs often win for voice agents.

TTS optimizations

Streaming TTS. Mandatory.
Fast models. Cartesia, Deepgram Aura.
First-audio latency under 200ms. Primary metric.
Cache common phrases. Greetings, goodbyes.
Sentence-boundary sends. Stream LLM tokens into TTS at sentence ends.

See streaming TTS: how to cut first-audio latency.

Network architecture

Co-locate services. All in same cloud region.
Direct provider connections. Where available.
Edge presence. For international calls.
Persistent connections. Keep WebSockets open; avoid reconnect.

Co-location math

Cross-country US: ~70ms RTT. Intra-region (e.g., us-east-1): under 20ms. Same AZ: under 5ms.

Every service hop adds. Co-locating STT + LLM + TTS saves 100-200ms of accumulated round-trips.

The 300ms frontier

Achievable by:

8B LLM co-hosted.
Cartesia TTS streaming.
Aggressive endpointing (500ms).
Deepgram Nova streaming.
Same-region infrastructure.
Pre-generated common phrases.

Requires specific engineering investment but credible.

Monitoring

TTFW (time to first word). End-to-end from caller finish to agent audio.
Per-stage latency. STT, LLM, TTS separately.
P50, p95, p99. Distribution matters.
Trend. Is it regressing over time?

Common sources of regression

Prompt growth. Add one more paragraph → 50ms slower.

Model change. Vendor swaps model; latency changes.

Infrastructure migration. Region change; cross-region calls.

Cold-start patterns. Traffic patterns shift; more cold starts.

Monitor continuously.

The barge-in challenge

Interruption (caller barges in during agent's speech) requires:

Fast VAD detection of caller speech.
Fast stop of TTS playback.
Fast pivot to processing new input.

Sub-200ms from caller start to TTS stop is the target.

See turn-taking and barge-in: the mechanics of natural conversation.

Cost tradeoffs

Lower latency often costs more:

Fast models: sometimes more expensive.
Co-located infra: operational overhead.
Premium TTS: higher per-minute.
Multiple fallback paths: redundancy cost.

For consumer-facing deployments, latency pays for itself in conversion.

When to stop optimizing

Sub-500ms reliably achieved:

Marginal gains get expensive.
User impact diminishes (500ms is already good).
Focus might be better on conversation quality.

First-call performance

First call after idle often slower:

Model loading.
Connection establishment.
Cache misses.

Keep-alive / warmup mitigates.

Long-call degradation

After 10+ minutes, some deployments slow:

Context window filling.
Memory pressure.
GC pauses.

Summarize conversation periodically.

Common pitfalls

Not measuring p95/p99. Average is fine; tail is brutal.

Vendor-reported benchmarks. Measure yourself.

One stage non-streaming. Any non-streaming link kills the pipeline.

Large prompts. Every token matters.

Cold paths in production. Edge cases hit slow paths.

FAQ

Can we hit sub-300ms reliably? Yes with effort. Requires engineering investment.

What's the latency floor? Physics + compute. Around 150-200ms minimum for current architectures.

Does streaming matter if my pipeline is fast? Yes. Compounds on top.

What about users on slow networks? Degrades gracefully; transport overhead increases.

How often should we re-benchmark? Monthly baseline; alert on regressions real-time.

The Engineering Behind Sub-Second Voice Agents

TL;DR

The target

The layers to optimize

Endpointing — the often-overlooked win

STT optimizations

LLM optimizations

TTS optimizations

Network architecture

Co-location math

The 300ms frontier

Monitoring

Common sources of regression

The barge-in challenge

Cost tradeoffs

When to stop optimizing

First-call performance

Long-call degradation

Common pitfalls

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

Streaming Audio Over WebRTC for Voice Agents

Echo Cancellation in Real-Time Voice AI

Latency Engineering for Real-Time Voice Agents

Voice AI, twice a month.