Streaming the LLM's output to TTS as it generates is the difference between a snappy voice agent and a sluggish one. The basic idea is simple: don't wait for the model to finish thinking before you start speaking. The implementation has more edge cases than you'd expect — sentence boundaries, mid-stream cancellation, latency vs quality trade-offs. This is the engineering layer.

TL;DR

Stream LLM tokens as they generate; chunk into sentence-or-phrase boundaries; pass to TTS.
Total perceived latency drops from "wait for full reply" (1–2 seconds) to "first audio in 200–400ms."
The hard parts: handling mid-stream cancellation, mid-sentence pauses, and non-text outputs (function calls).
Modern voice agent platforms handle this for you — but understanding it helps when something breaks.

The naive way

Without streaming:

Send prompt to LLM, wait for full reply.
Take full reply text, send to TTS.
Wait for full audio file.
Stream audio to caller.

Total latency: 1.5–3 seconds depending on reply length.

The streaming way

With streaming:

Send prompt to LLM with stream: true.
As tokens arrive, accumulate into a buffer.
When the buffer contains a complete phrase (sentence, clause), send that phrase to TTS.
As TTS produces audio chunks, stream them to the caller.
Continue streaming until the LLM finishes.

Total latency: 200–500ms to first audio.

The chunking decision

You have to decide when to send a phrase to TTS. Three strategies:

Sentence-by-sentence. Wait for a ., ?, or !. Send the sentence to TTS.

Pros: natural prosody from TTS.
Cons: long sentences delay first audio.

Clause-by-clause. Send on commas as well as sentence-end. More aggressive.

Pros: lower latency.
Cons: TTS prosody can be choppy on partial clauses.

Chunk size. Send every N tokens regardless of punctuation.

Pros: predictable latency.
Cons: TTS pronunciation suffers on awkward boundaries.

The right choice depends on your TTS provider. Simba Flash and Cartesia Sonic both handle clause-by-clause well. Older TTS systems prefer full sentences.

Mid-stream cancellation

The hardest case. The user barges in while the LLM is still streaming AND TTS is still synthesizing AND audio is still being sent to the caller.

You have to:

Cancel the LLM stream (close the SSE/WebSocket connection).
Cancel any in-flight TTS chunks.
Flush the audio buffered at the telephony provider.
Update conversation state to reflect the partial reply.

Each step has its own race condition. The most common bug: the LLM stream is cancelled but tokens that were already in flight still arrive and get processed. Defense: track a generation ID; ignore tokens from a cancelled generation.

For more on barge-in, see how voice agents handle interruptions gracefully.

Function calls in a streaming context

When the LLM emits a function call instead of text, streaming gets weird. The function call arrives as a JSON structure, not as natural language.

Pattern:

Detect that the streamed token is part of a function call (specific markers depending on the provider).
Buffer the full function call before executing.
Execute the function (potentially slow — say "let me check on that" first).
Send the function result back to the LLM as a new message.
Stream the LLM's next reply.

In voice, the function-call latency is the single biggest source of perceived "slow" responses. Mitigations:

Cap function timeouts (1.5–3 seconds).
Use the "let me check" bridge.
Cache function results within a call where possible.

The mid-sentence pause problem

A subtle issue: the LLM finishes the first sentence quickly, TTS starts playing it, but the LLM's next sentence isn't ready yet. The caller hears a weird mid-thought pause.

Solutions:

Buffer one sentence ahead. Don't start playing the first sentence until the second is at least starting to arrive. Adds 200ms of latency but eliminates the awkward pause.

Use longer sentences in the prompt. "Use sentences of at least 8 words" — buys time for the next sentence to be ready.

Pre-pend a hedge to slow generations. "Let me think about that" before the actual reply.

The right trade-off depends on your use case.

Streaming and partial outputs

A pattern worth knowing: stream the LLM, render a partial transcript on the dashboard in real time. Lets operators monitor calls without waiting for the call to end.

Implementation: forward the LLM's tokens to a WebSocket per call ID; the dashboard subscribes.

Most platforms include this; it's nice-to-have but not critical.

What providers expect

Streaming support varies:

LLM providers. All major hosted LLMs support streaming via SSE or WebSocket. Use it.

TTS providers. Simba Flash, Cartesia Sonic, OpenAI TTS all support streaming text input and streaming audio output. Older providers may not.

Telephony. Twilio, Plivo, etc. all support streaming media APIs. Slightly fiddly to set up but doable.

If any layer of your stack doesn't stream, you're paying significant latency cost.

A reference implementation pattern

In TypeScript-ish pseudocode:

async function streamReply(prompt: string, onAudio: (chunk: Buffer) => void) {
  const llmStream = await llm.complete(prompt, { stream: true });
  const ttsStream = tts.openStream();

  let buffer = "";
  for await (const token of llmStream) {
    buffer += token.text;

    // Detect sentence boundary
    const sentenceEnd = buffer.match(/[.!?]\s+/);
    if (sentenceEnd) {
      const sentence = buffer.slice(0, sentenceEnd.index! + 2);
      buffer = buffer.slice(sentenceEnd.index! + 2);
      ttsStream.send(sentence);
    }
  }
  if (buffer.trim()) ttsStream.send(buffer);

  for await (const audioChunk of ttsStream.audio) {
    onAudio(audioChunk);
  }
}

Real implementations are more complex (function calls, cancellation, error handling) but this is the shape.

FAQ

Is streaming always better? For sync voice agents, yes. For async (voicemail handlers, batch outbound), streaming doesn't help.

Can I use a non-streaming LLM with streaming TTS? You can, but you give up most of the latency benefit. The full LLM reply has to be ready before TTS starts.

What about reasoning models that "think" silently before generating? The thinking phase adds latency. For voice, prefer non-reasoning models or use the "let me think" bridge.

Does streaming affect quality? Slightly — TTS prosody on chunk boundaries isn't always perfect. Modern providers are very good; the quality hit is usually invisible.

Can I see the streamed transcript live? Yes — most platforms expose a live transcript via WebSocket. Useful for ops and debugging.

Streaming LLM Outputs to Voice: The Engineering

TL;DR

The naive way

The streaming way

The chunking decision

Mid-stream cancellation

Function calls in a streaming context

The mid-sentence pause problem

Streaming and partial outputs

What providers expect

A reference implementation pattern

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

Why Smaller LLMs Often Win for Voice Agents

Designing Voice Agents That Ask Better Questions

Open-Source vs Closed-Source LLMs for Voice Agents

Voice AI, twice a month.