First-audio latency — the time from when the TTS receives text to when the caller hears the first sound — is one of the biggest levers in voice agent latency optimization. Non-streaming TTS generates the full audio before playing any of it; for a 5-second response, that's a 1-2 second delay before the caller hears anything. Streaming TTS starts playing audio as the first chunk is synthesized, cutting first-audio latency to 100-200ms. This piece covers the mechanics, tradeoffs, and implementation patterns.

TL;DR

Streaming TTS = generating and playing audio chunks as they're synthesized.
First-audio latency target: under 200ms.
Major providers (Cartesia, Simba, Deepgram, OpenAI) all support streaming.
Integration pattern: chunk-by-chunk audio streaming via WebSocket or HTTP/2.
Tradeoff: streaming adds complexity; quality may be slightly lower than batch.

Non-streaming vs streaming

Non-streaming:

Text input.
Full audio generated.
Audio returned.
Playback begins.
First-audio latency: seconds for long responses.

Streaming:

Text input (possibly also streaming from LLM).
First audio chunk synthesized (50-100ms).
First chunk begins playing.
Subsequent chunks generated and appended during playback.
First-audio latency: under 200ms.

Streaming is mandatory for voice agents. Non-streaming feels broken.

First-audio latency target

Under 100ms: exceptional.
100-200ms: good, standard for modern TTS.
200-400ms: acceptable.
Over 400ms: noticeable delay.

How streaming works

TTS models have shifted from autoregressive-per-sample to chunk-based:

Generate 100-500ms audio chunks.
Send each chunk over as soon as ready.
Receiver buffers minimally before playing.

Provider support

Simba: streaming WebSocket API, ~150-250ms first-audio.

Cartesia: streaming-first design, ~80-150ms typical.

OpenAI (via Realtime API): streaming, ~150-300ms.

Deepgram Aura: streaming, ~100-200ms.

Google Cloud TTS: streaming available, latency varies.

Azure: streaming supported, enterprise-focused.

Integration pattern

Voice Agent backend:
  Receive text input (possibly streaming from LLM).
  Open WebSocket to TTS provider.
  Send text.
  Receive audio chunks.
  Forward chunks to caller's audio stream.
  Close WebSocket when done.

Or with HTTP/2:

POST /v1/tts (streaming)
Content-Type: audio/wav
Transfer-Encoding: chunked

[receive chunks as they arrive]

Handling LLM + TTS streaming

Best pattern:

LLM streams tokens.
Accumulate tokens into sentences.
Each complete sentence → send to TTS.
TTS streams audio back.
Audio plays immediately.

This means callers hear the first sentence of the LLM's response while the LLM is still generating the rest.

Example:

LLM starts generating at 100ms.
First complete sentence by 300ms.
First TTS chunk at 450ms.
Caller hears speech at 500ms.
LLM still generating; TTS still synthesizing.
By end, fully played out.

Sentence boundary detection

To stream into TTS sentence-by-sentence:

Accumulate LLM output.
Detect sentence boundaries (period, question mark, exclamation).
Handle abbreviations ("Dr.", "Mr.") so you don't break on them.
Send each sentence to TTS immediately.

Better than waiting for the full response.

Prefetch / speculation

Some advanced setups:

Predict the LLM's first words based on partial input.
Pre-start TTS on predicted text.
If prediction matches, skip latency.
If not, throw away.

Tricky; not widely deployed yet.

Buffering

Playback needs some buffer to avoid audio glitches:

Very small buffer (sub-100ms) = lowest latency, risk of underrun.
Larger buffer (200-500ms) = smoother, higher latency.

Typical: 50-150ms buffer.

Error handling

Streaming has more failure modes:

WebSocket disconnects mid-stream.
Chunks arrive out of order (rare with TCP; more with UDP).
Synthesis errors partway.

Fallback strategies:

Detect disconnect, reconnect.
Complete with non-streaming if streaming fails.
Cache partial output.

Quality considerations

Non-streaming TTS sometimes generates slightly better audio because it has full context:

Intonation at sentence ends.
Pacing decisions.
Emphasis.

Gap in 2026 is narrow. Most listeners can't distinguish.

Cost

Streaming vs non-streaming cost:

Usually same per-minute.
Streaming may add slightly (infrastructure overhead).
Per-minute pricing dominates.

Sample flow

[LLM starts generating]
Time 0ms: LLM begins.
Time 200ms: LLM emits "Happy to help."
Time 250ms: Send "Happy to help." to TTS.
Time 350ms: First audio chunk arrives.
Time 350ms: Playback begins on caller's side.
Time 380ms: LLM emits "Let me look that up."
Time 420ms: Send "Let me look that up." to TTS.
...

Caller hears: "Happy to help. Let me look that up." 
starting 350ms after they finished speaking.

Implementation gotchas

Partial sentence sends. Don't send "Happy to hel" to TTS — wait for sentence boundary.

Over-eager sends. "Dr. Patel" — don't break on period after "Dr."

Race conditions. If caller interrupts mid-playback, cancel in-flight TTS.

Audio format consistency. TTS output format must match downstream audio pipeline (sample rate, encoding).

Interruption handling

Caller barges in while TTS is playing:

Detect interruption (VAD sees caller speech).
Cancel remaining TTS synthesis.
Stop playback.
Process new input.

Critical for natural conversation.

See turn-taking and barge-in: the mechanics of natural conversation.

Testing

Local test: measure TTS latency from local requests.
End-to-end test: simulated calls with metrics.
Load test: concurrent streams at scale.

Measure first-audio latency, not just total generation time.

Common pitfalls

Not streaming. Default config uses batch. Always enable streaming explicitly.

Waiting for full LLM output. Negates the win.

Mismatched formats. Audio encoding doesn't match pipeline. Glitches.

Large buffer. 500ms buffer = 500ms baseline latency. Minimize.

No interruption handling. Caller can't barge in; feels robotic.

FAQ

Can we switch TTS vendors at runtime? Possible but adds complexity. Usually pick one per deployment.

What about caching responses? Cache common phrases (greetings, goodbyes). Substantial latency win.

Does streaming support voice cloning? Most vendors yes. Some custom voices don't support streaming.

What about audio effects (background music, etc.)? Layer on top of TTS output. Typically done in audio mixer.

How do we measure first-audio latency? Timestamp when TTS receives input; timestamp when first audio chunk arrives; subtract.

Streaming TTS: How to Cut First-Audio Latency

TL;DR

Non-streaming vs streaming

First-audio latency target

How streaming works

Provider support

Integration pattern

Handling LLM + TTS streaming

Sentence boundary detection

Prefetch / speculation

Buffering

Error handling

Quality considerations

Cost

Sample flow

Implementation gotchas

Interruption handling

Testing

Common pitfalls

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

Streaming Audio Over WebRTC for Voice Agents

How to Benchmark a Voice Agent's End-to-End Latency

Comparing Neural TTS Architectures

Voice AI, twice a month.