Streaming LLM Outputs to Voice: The Engineering
Streaming the LLM's output to TTS as it generates is the difference between a snappy voice agent and a sluggish one. The basic idea is simple: don't wait for the model to finish thinking before you start speaking.
Streaming the LLM's output to TTS as it generates is the difference between a snappy voice agent and a sluggish one. The basic idea is simple: don't wait for the model to finish thinking before you start speaking. The implementation has more edge cases than you'd expect โ sentence boundaries, mid-stream cancellation, latency vs quality trade-offs. This is the engineering layer.
TL;DR
- Stream LLM tokens as they generate; chunk into sentence-or-phrase boundaries; pass to TTS.
- Total perceived latency drops from "wait for full reply" (1โ2 seconds) to "first audio in 200โ400ms."
- The hard parts: handling mid-stream cancellation, mid-sentence pauses, and non-text outputs (function calls).
- Modern voice agent platforms handle this for you โ but understanding it helps when something breaks.
The naive way
Without streaming:
- Send prompt to LLM, wait for full reply.
- Take full reply text, send to TTS.
- Wait for full audio file.
- Stream audio to caller.
Total latency: 1.5โ3 seconds depending on reply length.
The streaming way
With streaming:
- Send prompt to LLM with
stream: true. - As tokens arrive, accumulate into a buffer.
- When the buffer contains a complete phrase (sentence, clause), send that phrase to TTS.
- As TTS produces audio chunks, stream them to the caller.
- Continue streaming until the LLM finishes.
Total latency: 200โ500ms to first audio.
The chunking decision
You have to decide when to send a phrase to TTS. Three strategies:
Sentence-by-sentence. Wait for a ., ?, or !. Send the sentence to TTS.
- Pros: natural prosody from TTS.
- Cons: long sentences delay first audio.
Clause-by-clause. Send on commas as well as sentence-end. More aggressive.
- Pros: lower latency.
- Cons: TTS prosody can be choppy on partial clauses.
Chunk size. Send every N tokens regardless of punctuation.
- Pros: predictable latency.
- Cons: TTS pronunciation suffers on awkward boundaries.
The right choice depends on your TTS provider. Simba Flash and Cartesia Sonic both handle clause-by-clause well. Older TTS systems prefer full sentences.
Mid-stream cancellation
The hardest case. The user barges in while the LLM is still streaming AND TTS is still synthesizing AND audio is still being sent to the caller.
You have to:
- Cancel the LLM stream (close the SSE/WebSocket connection).
- Cancel any in-flight TTS chunks.
- Flush the audio buffered at the telephony provider.
- Update conversation state to reflect the partial reply.
Each step has its own race condition. The most common bug: the LLM stream is cancelled but tokens that were already in flight still arrive and get processed. Defense: track a generation ID; ignore tokens from a cancelled generation.
For more on barge-in, see how voice agents handle interruptions gracefully.
Function calls in a streaming context
When the LLM emits a function call instead of text, streaming gets weird. The function call arrives as a JSON structure, not as natural language.
Pattern:
- Detect that the streamed token is part of a function call (specific markers depending on the provider).
- Buffer the full function call before executing.
- Execute the function (potentially slow โ say "let me check on that" first).
- Send the function result back to the LLM as a new message.
- Stream the LLM's next reply.
In voice, the function-call latency is the single biggest source of perceived "slow" responses. Mitigations:
- Cap function timeouts (1.5โ3 seconds).
- Use the "let me check" bridge.
- Cache function results within a call where possible.
The mid-sentence pause problem
A subtle issue: the LLM finishes the first sentence quickly, TTS starts playing it, but the LLM's next sentence isn't ready yet. The caller hears a weird mid-thought pause.
Solutions:
Buffer one sentence ahead. Don't start playing the first sentence until the second is at least starting to arrive. Adds 200ms of latency but eliminates the awkward pause.
Use longer sentences in the prompt. "Use sentences of at least 8 words" โ buys time for the next sentence to be ready.
Pre-pend a hedge to slow generations. "Let me think about that" before the actual reply.
The right trade-off depends on your use case.
Streaming and partial outputs
A pattern worth knowing: stream the LLM, render a partial transcript on the dashboard in real time. Lets operators monitor calls without waiting for the call to end.
Implementation: forward the LLM's tokens to a WebSocket per call ID; the dashboard subscribes.
Most platforms include this; it's nice-to-have but not critical.
What providers expect
Streaming support varies:
LLM providers. All major hosted LLMs support streaming via SSE or WebSocket. Use it.
TTS providers. Simba Flash, Cartesia Sonic, OpenAI TTS all support streaming text input and streaming audio output. Older providers may not.
Telephony. Twilio, Plivo, etc. all support streaming media APIs. Slightly fiddly to set up but doable.
If any layer of your stack doesn't stream, you're paying significant latency cost.
A reference implementation pattern
In TypeScript-ish pseudocode:
async function streamReply(prompt: string, onAudio: (chunk: Buffer) => void) {
const llmStream = await llm.complete(prompt, { stream: true });
const ttsStream = tts.openStream();
let buffer = "";
for await (const token of llmStream) {
buffer += token.text;
// Detect sentence boundary
const sentenceEnd = buffer.match(/[.!?]\s+/);
if (sentenceEnd) {
const sentence = buffer.slice(0, sentenceEnd.index! + 2);
buffer = buffer.slice(sentenceEnd.index! + 2);
ttsStream.send(sentence);
}
}
if (buffer.trim()) ttsStream.send(buffer);
for await (const audioChunk of ttsStream.audio) {
onAudio(audioChunk);
}
}
Real implementations are more complex (function calls, cancellation, error handling) but this is the shape.
Related reading
- Why Smaller LLMs Often Win for Voice Agents
- How Large Language Models Power Voice Agents
- Designing Voice Agents That Ask Better Questions
- Open-Source vs Closed-Source LLMs for Voice Agents
- How LLMs Decide What to Say Next in a Voice Conversation
FAQ
Is streaming always better? For sync voice agents, yes. For async (voicemail handlers, batch outbound), streaming doesn't help.
Can I use a non-streaming LLM with streaming TTS? You can, but you give up most of the latency benefit. The full LLM reply has to be ready before TTS starts.
What about reasoning models that "think" silently before generating? The thinking phase adds latency. For voice, prefer non-reasoning models or use the "let me think" bridge.
Does streaming affect quality? Slightly โ TTS prosody on chunk boundaries isn't always perfect. Modern providers are very good; the quality hit is usually invisible.
Can I see the streamed transcript live? Yes โ most platforms expose a live transcript via WebSocket. Useful for ops and debugging.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all โOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Why Smaller LLMs Often Win for Voice Agents
There's a strong reflex in AI: bigger model = better outcome. For voice agents specifically, this reflex is often wrong. A fast 8B parameter model with sub-200ms time-to-first-token can outperform a 70B frontier model on nearly every voice metric that matters.
Designing Voice Agents That Ask Better Questions
A voice agent that asks bad questions wastes the caller's time and produces bad data. Good questions feel natural and capture what you need in fewer turns.
Open-Source vs Closed-Source LLMs for Voice Agents
The open-source LLM ecosystem caught up to closed models faster than anyone expected. Llama 3.3, Mistral, Qwen โ all good enough for most voice agent use cases.
Voice AI, twice a month.
Get the best of the SIMBA resources hub โ new articles, trend notes, and operator guides. No spam.
