Streaming TTS: How to Cut First-Audio Latency
First-audio latency — the time from when the TTS receives text to when the caller hears the first sound — is one of the biggest levers in voice agent latency optimization.
First-audio latency — the time from when the TTS receives text to when the caller hears the first sound — is one of the biggest levers in voice agent latency optimization. Non-streaming TTS generates the full audio before playing any of it; for a 5-second response, that's a 1-2 second delay before the caller hears anything. Streaming TTS starts playing audio as the first chunk is synthesized, cutting first-audio latency to 100-200ms. This piece covers the mechanics, tradeoffs, and implementation patterns.
TL;DR
- Streaming TTS = generating and playing audio chunks as they're synthesized.
- First-audio latency target: under 200ms.
- Major providers (Cartesia, Simba, Deepgram, OpenAI) all support streaming.
- Integration pattern: chunk-by-chunk audio streaming via WebSocket or HTTP/2.
- Tradeoff: streaming adds complexity; quality may be slightly lower than batch.
Non-streaming vs streaming
Non-streaming:
- Text input.
- Full audio generated.
- Audio returned.
- Playback begins.
- First-audio latency: seconds for long responses.
Streaming:
- Text input (possibly also streaming from LLM).
- First audio chunk synthesized (50-100ms).
- First chunk begins playing.
- Subsequent chunks generated and appended during playback.
- First-audio latency: under 200ms.
Streaming is mandatory for voice agents. Non-streaming feels broken.
First-audio latency target
- Under 100ms: exceptional.
- 100-200ms: good, standard for modern TTS.
- 200-400ms: acceptable.
- Over 400ms: noticeable delay.
How streaming works
TTS models have shifted from autoregressive-per-sample to chunk-based:
- Generate 100-500ms audio chunks.
- Send each chunk over as soon as ready.
- Receiver buffers minimally before playing.
Provider support
Simba: streaming WebSocket API, ~150-250ms first-audio.
Cartesia: streaming-first design, ~80-150ms typical.
OpenAI (via Realtime API): streaming, ~150-300ms.
Deepgram Aura: streaming, ~100-200ms.
Google Cloud TTS: streaming available, latency varies.
Azure: streaming supported, enterprise-focused.
Integration pattern
Voice Agent backend:
Receive text input (possibly streaming from LLM).
Open WebSocket to TTS provider.
Send text.
Receive audio chunks.
Forward chunks to caller's audio stream.
Close WebSocket when done.
Or with HTTP/2:
POST /v1/tts (streaming)
Content-Type: audio/wav
Transfer-Encoding: chunked
[receive chunks as they arrive]
Handling LLM + TTS streaming
Best pattern:
- LLM streams tokens.
- Accumulate tokens into sentences.
- Each complete sentence → send to TTS.
- TTS streams audio back.
- Audio plays immediately.
This means callers hear the first sentence of the LLM's response while the LLM is still generating the rest.
Example:
- LLM starts generating at 100ms.
- First complete sentence by 300ms.
- First TTS chunk at 450ms.
- Caller hears speech at 500ms.
- LLM still generating; TTS still synthesizing.
- By end, fully played out.
Sentence boundary detection
To stream into TTS sentence-by-sentence:
- Accumulate LLM output.
- Detect sentence boundaries (period, question mark, exclamation).
- Handle abbreviations ("Dr.", "Mr.") so you don't break on them.
- Send each sentence to TTS immediately.
Better than waiting for the full response.
Prefetch / speculation
Some advanced setups:
- Predict the LLM's first words based on partial input.
- Pre-start TTS on predicted text.
- If prediction matches, skip latency.
- If not, throw away.
Tricky; not widely deployed yet.
Buffering
Playback needs some buffer to avoid audio glitches:
- Very small buffer (sub-100ms) = lowest latency, risk of underrun.
- Larger buffer (200-500ms) = smoother, higher latency.
Typical: 50-150ms buffer.
Error handling
Streaming has more failure modes:
- WebSocket disconnects mid-stream.
- Chunks arrive out of order (rare with TCP; more with UDP).
- Synthesis errors partway.
Fallback strategies:
- Detect disconnect, reconnect.
- Complete with non-streaming if streaming fails.
- Cache partial output.
Quality considerations
Non-streaming TTS sometimes generates slightly better audio because it has full context:
- Intonation at sentence ends.
- Pacing decisions.
- Emphasis.
Gap in 2026 is narrow. Most listeners can't distinguish.
Cost
Streaming vs non-streaming cost:
- Usually same per-minute.
- Streaming may add slightly (infrastructure overhead).
- Per-minute pricing dominates.
Sample flow
[LLM starts generating]
Time 0ms: LLM begins.
Time 200ms: LLM emits "Happy to help."
Time 250ms: Send "Happy to help." to TTS.
Time 350ms: First audio chunk arrives.
Time 350ms: Playback begins on caller's side.
Time 380ms: LLM emits "Let me look that up."
Time 420ms: Send "Let me look that up." to TTS.
...
Caller hears: "Happy to help. Let me look that up."
starting 350ms after they finished speaking.
Implementation gotchas
Partial sentence sends. Don't send "Happy to hel" to TTS — wait for sentence boundary.
Over-eager sends. "Dr. Patel" — don't break on period after "Dr."
Race conditions. If caller interrupts mid-playback, cancel in-flight TTS.
Audio format consistency. TTS output format must match downstream audio pipeline (sample rate, encoding).
Interruption handling
Caller barges in while TTS is playing:
- Detect interruption (VAD sees caller speech).
- Cancel remaining TTS synthesis.
- Stop playback.
- Process new input.
Critical for natural conversation.
See turn-taking and barge-in: the mechanics of natural conversation.
Testing
- Local test: measure TTS latency from local requests.
- End-to-end test: simulated calls with metrics.
- Load test: concurrent streams at scale.
Measure first-audio latency, not just total generation time.
Common pitfalls
Not streaming. Default config uses batch. Always enable streaming explicitly.
Waiting for full LLM output. Negates the win.
Mismatched formats. Audio encoding doesn't match pipeline. Glitches.
Large buffer. 500ms buffer = 500ms baseline latency. Minimize.
No interruption handling. Caller can't barge in; feels robotic.
Related reading
- Text-to-Speech in 2026: The State of the Art
- Latency Engineering for Real-Time Voice Agents
- How to Benchmark a Voice Agent's End-to-End Latency
- Streaming Audio Over WebRTC for Voice Agents
- Comparing Neural TTS Architectures
FAQ
Can we switch TTS vendors at runtime? Possible but adds complexity. Usually pick one per deployment.
What about caching responses? Cache common phrases (greetings, goodbyes). Substantial latency win.
Does streaming support voice cloning? Most vendors yes. Some custom voices don't support streaming.
What about audio effects (background music, etc.)? Layer on top of TTS output. Typically done in audio mixer.
How do we measure first-audio latency? Timestamp when TTS receives input; timestamp when first audio chunk arrives; subtract.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all →Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport — lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
How to Benchmark a Voice Agent's End-to-End Latency
Vendor-reported latency is a lab number. What matters for your voice agent is measured latency in your production environment, under real network conditions, with your actual content.
Comparing Neural TTS Architectures
Neural TTS has evolved rapidly since 2018 — Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements.
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
