🔊 Speech Technology

The Engineering Behind Sub-Second Voice Agents

Sub-second voice agents — end-to-end latency under 1000ms from caller speech end to agent speech start — used to be aspirational. In 2026 it's table stakes for production voice AI, and leading deployments are hitting sub-500ms.

Tyler Weitzman
Tyler Weitzman
March 15, 2026 · 4 min read
Speechify

Sub-second voice agents — end-to-end latency under 1000ms from caller speech end to agent speech start — used to be aspirational. In 2026 it's table stakes for production voice AI, and leading deployments are hitting sub-500ms. The engineering to get there involves streaming at every layer, co-location of services, aggressive model selection, and careful pipeline orchestration. This piece covers what it takes, where the gains are, and the tradeoffs you make.

TL;DR

  • Sub-500ms median is achievable with streaming + co-location + fast models.
  • Sub-300ms is the new frontier, reserved for top-tier deployments.
  • The budget: STT 50-150ms, LLM 100-400ms, TTS 100-250ms, overheads.
  • Stream everything; co-locate where possible; pick fast models for routine turns.
  • Measure p50, p95, p99 separately.

The target

  • 1000ms: "works but feels slow."
  • 800ms: "acceptable."
  • 500ms: "feels conversational."
  • 300ms: "indistinguishable from human."

Most 2026 deployments target 500-700ms median.

The layers to optimize

1. Audio capture and VAD. Client-side; minimal optimization once working.

2. Endpointing. How fast agent decides caller is done. Biggest single lever.

3. STT. Streaming; final transcript ~50-150ms after endpoint.

4. LLM. Biggest variable. Model size, prompt length, infrastructure matter.

5. TTS. Streaming; first audio ~100-250ms.

6. Network. Co-location matters; cross-region adds tens of ms.

Endpointing — the often-overlooked win

Default endpointing is often conservative:

  • 800ms silence = end of utterance.

Tune:

  • 500ms for routine conversation.
  • 400ms for decisive, fast-paced use cases.
  • 700-900ms when callers need to think.

Every 100ms shaved off endpointing = 100ms faster response.

See voice activity detection in production voice agents.

STT optimizations

  • Streaming. Partials available; finalize fast.
  • Fast models. Deepgram Nova-3, Cartesia.
  • Domain vocabulary biasing. Not just for WER — also shortens recognition time by pruning search space.
  • Reasonable endpointing. STT's endpoint detection can be slower than client-side VAD.

LLM optimizations

Biggest latency contributor. Strategies:

Small fast model for routine. 8B-class model handles 80%+ of turns. Fast.

Frontier model for complex. Escalate to GPT-4o / Claude for hard reasoning moments only.

Prompt compression. Shorter prompts = faster processing. Trim boilerplate.

Streaming output. First token 150-300ms; TTS starts early.

Co-located inference. Self-hosted or provider in same region. Saves 50-80ms.

See why smaller LLMs often win for voice agents.

TTS optimizations

  • Streaming TTS. Mandatory.
  • Fast models. Cartesia, Deepgram Aura.
  • First-audio latency under 200ms. Primary metric.
  • Cache common phrases. Greetings, goodbyes.
  • Sentence-boundary sends. Stream LLM tokens into TTS at sentence ends.

See streaming TTS: how to cut first-audio latency.

Network architecture

  • Co-locate services. All in same cloud region.
  • Direct provider connections. Where available.
  • Edge presence. For international calls.
  • Persistent connections. Keep WebSockets open; avoid reconnect.

Co-location math

Cross-country US: ~70ms RTT. Intra-region (e.g., us-east-1): under 20ms. Same AZ: under 5ms.

Every service hop adds. Co-locating STT + LLM + TTS saves 100-200ms of accumulated round-trips.

The 300ms frontier

Achievable by:

  • 8B LLM co-hosted.
  • Cartesia TTS streaming.
  • Aggressive endpointing (500ms).
  • Deepgram Nova streaming.
  • Same-region infrastructure.
  • Pre-generated common phrases.

Requires specific engineering investment but credible.

Monitoring

  • TTFW (time to first word). End-to-end from caller finish to agent audio.
  • Per-stage latency. STT, LLM, TTS separately.
  • P50, p95, p99. Distribution matters.
  • Trend. Is it regressing over time?

Common sources of regression

Prompt growth. Add one more paragraph → 50ms slower.

Model change. Vendor swaps model; latency changes.

Infrastructure migration. Region change; cross-region calls.

Cold-start patterns. Traffic patterns shift; more cold starts.

Monitor continuously.

The barge-in challenge

Interruption (caller barges in during agent's speech) requires:

  • Fast VAD detection of caller speech.
  • Fast stop of TTS playback.
  • Fast pivot to processing new input.

Sub-200ms from caller start to TTS stop is the target.

See turn-taking and barge-in: the mechanics of natural conversation.

Cost tradeoffs

Lower latency often costs more:

  • Fast models: sometimes more expensive.
  • Co-located infra: operational overhead.
  • Premium TTS: higher per-minute.
  • Multiple fallback paths: redundancy cost.

For consumer-facing deployments, latency pays for itself in conversion.

When to stop optimizing

Sub-500ms reliably achieved:

  • Marginal gains get expensive.
  • User impact diminishes (500ms is already good).
  • Focus might be better on conversation quality.

First-call performance

First call after idle often slower:

  • Model loading.
  • Connection establishment.
  • Cache misses.

Keep-alive / warmup mitigates.

Long-call degradation

After 10+ minutes, some deployments slow:

  • Context window filling.
  • Memory pressure.
  • GC pauses.

Summarize conversation periodically.

Common pitfalls

Not measuring p95/p99. Average is fine; tail is brutal.

Vendor-reported benchmarks. Measure yourself.

One stage non-streaming. Any non-streaming link kills the pipeline.

Large prompts. Every token matters.

Cold paths in production. Edge cases hit slow paths.

FAQ

Can we hit sub-300ms reliably? Yes with effort. Requires engineering investment.

What's the latency floor? Physics + compute. Around 150-200ms minimum for current architectures.

Does streaming matter if my pipeline is fast? Yes. Compounds on top.

What about users on slow networks? Degrades gracefully; transport overhead increases.

How often should we re-benchmark? Monthly baseline; alert on regressions real-time.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all →

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.