Latency is what separates voice agents that feel conversational from those that feel broken. Humans expect responses within 700ms of finishing a sentence — anything longer triggers a "did they hear me?" reaction. Sub-500ms feels alive. Sub-300ms feels exceptional. Getting there requires deliberate engineering across the entire pipeline, from audio capture to TTS playback. This piece covers the practical latency engineering for production voice agents.

TL;DR

Target: sub-500ms median round-trip; sub-800ms p95.
Budget breakdown: STT (50-150ms), LLM (100-400ms), TTS (100-250ms), network overhead.
Stream everything; don't wait for one stage to finish before starting the next.
Small, fast LLMs + streaming outputs win the latency race.
Measure p50, p95, p99 — averages hide the problems.

The latency budget

Target breakdown for a sub-500ms round-trip:

Audio capture + VAD endpointing: 100-200ms.
STT (final transcript): 50-150ms after endpoint.
LLM first-token: 100-300ms.
TTS first-audio: 100-200ms.
Network round-trips: 30-80ms.

Overlapping stages reduce total. With heavy streaming: sub-500ms is achievable.

Where latency actually lives

Audio capture. Minimal — sub-20ms.

VAD endpointing. Determines when caller finished speaking. Typically 100-300ms. Aggressive tuning can go lower but risks cutting off caller mid-sentence.

STT. Modern streaming STT emits partials during speech; final transcript arrives ~50-150ms after endpoint.

LLM inference. Time to first token. Depends on model size and provider.

TTS first audio. Time from text input to first audio chunk. 100-200ms for streaming TTS.

TTS synthesis. Continues during playback. Doesn't block first audio.

Network. RTT per hop. US coast-to-coast ~70ms; intra-region under 20ms.

The streaming principle

Don't wait for each stage to finish:

STT streams partials → LLM receives and processes as they arrive.
LLM streams tokens → TTS starts synthesizing first sentences while LLM continues generating.
TTS streams audio chunks → caller hears beginning while rest is synthesized.

Overlap is the secret.

See streaming LLM outputs to voice: the engineering, streaming TTS: how to cut first-audio latency, streaming STT: how to cut recognition latency.

VAD / endpointing tradeoffs

Aggressive endpointing:

Pros: faster response.
Cons: cuts off callers mid-sentence.

Conservative endpointing:

Pros: rarely cuts caller off.
Cons: feels slow.

Tune per use case. For casual conversation, aggressive. For information-heavy (dictation, data collection), conservative.

See voice activity detection in production voice agents.

LLM latency optimization

Biggest variable. Strategies:

Use smaller models. 8B parameters for turn-level decisions. GPT-4o class only for hard reasoning moments.

Prompt optimization. Shorter prompts = faster processing.

Streaming outputs. First token in 150-300ms beats full generation in 800ms.

Locally-hosted vs API. Local reduces API round-trip. Only practical at scale.

Region-matched. LLM region near voice AI region.

See why smaller LLMs often win for voice agents.

TTS latency optimization

Streaming TTS mandatory. Non-streaming is DOA for voice agents.

First-audio latency. Time from text input to first audio chunk. Sub-200ms target.

Model choice. Cartesia and Deepgram Aura lead on latency. Simba premium for quality; slightly higher latency.

Caching. Pre-synthesize common phrases (greetings, goodbyes).

See streaming TTS: how to cut first-audio latency.

Network architecture

Voice agents span multiple services:

STT provider (often separate).
LLM provider (often separate).
TTS provider (often separate).
Orchestration layer.
Telephony.

Each hop adds latency. Minimize by:

Co-locating services in same cloud region.
Using the same provider for multiple stages when possible.
Direct provider-to-provider connections where available.

Measuring

Measure what matters:

Time to first word (TTFW) — time from caller endpoint to first audio.
P50 / p95 / p99 — not average.
Per-stage breakdown.
Over time — trending.

Don't rely on vendor benchmarks. Measure in your environment.

Tools for measurement

Custom instrumentation in your stack.
Vendor metrics (Twilio Voice Insights, etc.).
End-to-end testing harness — simulated calls with known content.

Common latency killers

Sequential processing. Waiting for STT to finish before starting LLM. Always stream.

Non-streaming LLM. Waiting for full response before TTS. Always stream tokens.

Non-streaming TTS. Waiting for full audio before playback. Always stream.

Cross-region calls. STT in US-East, LLM in US-West. Add ~70ms.

Cold starts. First call after idle hits slow path. Warm up.

Chatty prompts. Long system prompts take longer to process.

Unnecessary function calls. LLM calls a function mid-response → adds hundreds of ms.

First-call vs steady-state

First call after idle: typically 200-500ms slower.

Model loading.
Connection establishment.
Cache misses.

Keep-alives and warm pools mitigate.

Long-call latency

Sometimes latency degrades as call goes on:

Context window filling up → LLM slower.
Memory growing → GC pauses.

Monitor. Mitigate with conversation summarization (condense old turns).

Quality-latency tradeoff

Smaller LLM = faster but less capable.
Aggressive endpointing = faster but can cut off.
Budget TTS = faster but less natural.

Pick balances per use case.

The sub-300ms frontier

Leading deployments in 2026 hit sub-300ms:

8B-class LLMs with quick first-token.
Cartesia / Deepgram TTS.
Aggressive streaming.
Co-located infrastructure.

Achievable with engineering work.

Production monitoring

Daily p50/p95/p99 tracking.
Alert on regressions.
Per-deployment comparison.
Drill into outliers.

Common pitfalls

Tracking averages. Average looks fine; p95 is brutal. Callers notice p95.

Vendor-reported latency. Measured in lab. Yours may be 2x.

Ignoring endpointing. STT could be instant; VAD adds 300ms.

Non-streaming somewhere. Any stage non-streaming kills the whole pipeline.

Not testing cold starts. Works in dev; breaks first prod call.

FAQ

What's "acceptable" latency? Under 800ms — usable. Under 500ms — good. Under 300ms — exceptional.

Does streaming help if my LLM is slow? Yes — first token in 200ms beats full response in 1s.

How do we benchmark end-to-end? Simulated calls with known content; measure first-audio time.

Can we reduce LLM latency? Smaller model, prompt compression, streaming, region-matched.

When does latency hurt conversion? Above 1 second, measurable drop. Above 1.5s, significant.

Latency Engineering for Real-Time Voice Agents

TL;DR

The latency budget

Where latency actually lives

The streaming principle

VAD / endpointing tradeoffs

LLM latency optimization

TTS latency optimization

Network architecture

Measuring

Tools for measurement

Common latency killers

First-call vs steady-state

Long-call latency

Quality-latency tradeoff

The sub-300ms frontier

Production monitoring

Common pitfalls

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

Streaming Audio Over WebRTC for Voice Agents

Echo Cancellation in Real-Time Voice AI

The Engineering Behind Sub-Second Voice Agents

Voice AI, twice a month.