🔊 Speech Technology

Latency Engineering for Real-Time Voice Agents

Latency is what separates voice agents that feel conversational from those that feel broken. Humans expect responses within 700ms of finishing a sentence — anything longer triggers a "did they hear me?" reaction. Sub-500ms feels alive. Sub-300ms feels exceptional.

Tyler Weitzman
Tyler Weitzman
March 11, 2026 · 5 min read
Speechify

Latency is what separates voice agents that feel conversational from those that feel broken. Humans expect responses within 700ms of finishing a sentence — anything longer triggers a "did they hear me?" reaction. Sub-500ms feels alive. Sub-300ms feels exceptional. Getting there requires deliberate engineering across the entire pipeline, from audio capture to TTS playback. This piece covers the practical latency engineering for production voice agents.

TL;DR

  • Target: sub-500ms median round-trip; sub-800ms p95.
  • Budget breakdown: STT (50-150ms), LLM (100-400ms), TTS (100-250ms), network overhead.
  • Stream everything; don't wait for one stage to finish before starting the next.
  • Small, fast LLMs + streaming outputs win the latency race.
  • Measure p50, p95, p99 — averages hide the problems.

The latency budget

Target breakdown for a sub-500ms round-trip:

  • Audio capture + VAD endpointing: 100-200ms.
  • STT (final transcript): 50-150ms after endpoint.
  • LLM first-token: 100-300ms.
  • TTS first-audio: 100-200ms.
  • Network round-trips: 30-80ms.

Overlapping stages reduce total. With heavy streaming: sub-500ms is achievable.

Where latency actually lives

Audio capture. Minimal — sub-20ms.

VAD endpointing. Determines when caller finished speaking. Typically 100-300ms. Aggressive tuning can go lower but risks cutting off caller mid-sentence.

STT. Modern streaming STT emits partials during speech; final transcript arrives ~50-150ms after endpoint.

LLM inference. Time to first token. Depends on model size and provider.

TTS first audio. Time from text input to first audio chunk. 100-200ms for streaming TTS.

TTS synthesis. Continues during playback. Doesn't block first audio.

Network. RTT per hop. US coast-to-coast ~70ms; intra-region under 20ms.

The streaming principle

Don't wait for each stage to finish:

  • STT streams partials → LLM receives and processes as they arrive.
  • LLM streams tokens → TTS starts synthesizing first sentences while LLM continues generating.
  • TTS streams audio chunks → caller hears beginning while rest is synthesized.

Overlap is the secret.

See streaming LLM outputs to voice: the engineering, streaming TTS: how to cut first-audio latency, streaming STT: how to cut recognition latency.

VAD / endpointing tradeoffs

Aggressive endpointing:

  • Pros: faster response.
  • Cons: cuts off callers mid-sentence.

Conservative endpointing:

  • Pros: rarely cuts caller off.
  • Cons: feels slow.

Tune per use case. For casual conversation, aggressive. For information-heavy (dictation, data collection), conservative.

See voice activity detection in production voice agents.

LLM latency optimization

Biggest variable. Strategies:

Use smaller models. 8B parameters for turn-level decisions. GPT-4o class only for hard reasoning moments.

Prompt optimization. Shorter prompts = faster processing.

Streaming outputs. First token in 150-300ms beats full generation in 800ms.

Locally-hosted vs API. Local reduces API round-trip. Only practical at scale.

Region-matched. LLM region near voice AI region.

See why smaller LLMs often win for voice agents.

TTS latency optimization

Streaming TTS mandatory. Non-streaming is DOA for voice agents.

First-audio latency. Time from text input to first audio chunk. Sub-200ms target.

Model choice. Cartesia and Deepgram Aura lead on latency. Simba premium for quality; slightly higher latency.

Caching. Pre-synthesize common phrases (greetings, goodbyes).

See streaming TTS: how to cut first-audio latency.

Network architecture

Voice agents span multiple services:

  • STT provider (often separate).
  • LLM provider (often separate).
  • TTS provider (often separate).
  • Orchestration layer.
  • Telephony.

Each hop adds latency. Minimize by:

  • Co-locating services in same cloud region.
  • Using the same provider for multiple stages when possible.
  • Direct provider-to-provider connections where available.

Measuring

Measure what matters:

  • Time to first word (TTFW) — time from caller endpoint to first audio.
  • P50 / p95 / p99 — not average.
  • Per-stage breakdown.
  • Over time — trending.

Don't rely on vendor benchmarks. Measure in your environment.

Tools for measurement

  • Custom instrumentation in your stack.
  • Vendor metrics (Twilio Voice Insights, etc.).
  • End-to-end testing harness — simulated calls with known content.

Common latency killers

Sequential processing. Waiting for STT to finish before starting LLM. Always stream.

Non-streaming LLM. Waiting for full response before TTS. Always stream tokens.

Non-streaming TTS. Waiting for full audio before playback. Always stream.

Cross-region calls. STT in US-East, LLM in US-West. Add ~70ms.

Cold starts. First call after idle hits slow path. Warm up.

Chatty prompts. Long system prompts take longer to process.

Unnecessary function calls. LLM calls a function mid-response → adds hundreds of ms.

First-call vs steady-state

First call after idle: typically 200-500ms slower.

  • Model loading.
  • Connection establishment.
  • Cache misses.

Keep-alives and warm pools mitigate.

Long-call latency

Sometimes latency degrades as call goes on:

  • Context window filling up → LLM slower.
  • Memory growing → GC pauses.

Monitor. Mitigate with conversation summarization (condense old turns).

Quality-latency tradeoff

  • Smaller LLM = faster but less capable.
  • Aggressive endpointing = faster but can cut off.
  • Budget TTS = faster but less natural.

Pick balances per use case.

The sub-300ms frontier

Leading deployments in 2026 hit sub-300ms:

  • 8B-class LLMs with quick first-token.
  • Cartesia / Deepgram TTS.
  • Aggressive streaming.
  • Co-located infrastructure.

Achievable with engineering work.

Production monitoring

  • Daily p50/p95/p99 tracking.
  • Alert on regressions.
  • Per-deployment comparison.
  • Drill into outliers.

Common pitfalls

Tracking averages. Average looks fine; p95 is brutal. Callers notice p95.

Vendor-reported latency. Measured in lab. Yours may be 2x.

Ignoring endpointing. STT could be instant; VAD adds 300ms.

Non-streaming somewhere. Any stage non-streaming kills the whole pipeline.

Not testing cold starts. Works in dev; breaks first prod call.

FAQ

What's "acceptable" latency? Under 800ms — usable. Under 500ms — good. Under 300ms — exceptional.

Does streaming help if my LLM is slow? Yes — first token in 200ms beats full response in 1s.

How do we benchmark end-to-end? Simulated calls with known content; measure first-audio time.

Can we reduce LLM latency? Smaller model, prompt compression, streaming, region-matched.

When does latency hurt conversion? Above 1 second, measurable drop. Above 1.5s, significant.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all →

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.