How LLMs Decide What to Say Next in a Voice Conversation
Step inside the LLM's "head" for a moment and look at how it picks what to say on each turn of a voice call. The answer is less mysterious than the term "AI" suggests and more interesting than "next-token prediction" implies.
Step inside the LLM's "head" for a moment and look at how it picks what to say on each turn of a voice call. The answer is less mysterious than the term "AI" suggests and more interesting than "next-token prediction" implies. Understanding it helps you write better prompts, debug weird replies, and predict where the model will go off the rails.
TL;DR
- The LLM picks tokens one at a time, predicting the most likely next token based on everything before.
- "Everything before" includes the system prompt, conversation history, function definitions, and any retrieved context.
- For function calls vs free text, the model chooses based on prompt structure and tool descriptions.
- For voice specifically: temperature near zero, tight prompts, and clear instructions matter more than for chat.
The basic mechanism
LLMs generate text token by token. On each step:
- Look at all preceding tokens (the entire prompt + everything generated so far).
- Compute a probability distribution over the next possible token.
- Sample from that distribution (or pick the highest-probability token if temperature is 0).
- Append the chosen token; repeat until a stop condition.
For voice agents, this happens 50โ200 times per turn (one per token of the reply).
What the LLM is "thinking about"
It's not thinking about what's true. It's predicting what text would come next given the patterns it learned during training.
This explains a lot:
- Why hallucinations happen (the most "plausible" next token isn't always factually correct)
- Why tight prompts work (they constrain the high-probability paths)
- Why examples in the prompt help (they shift the probability distribution toward your patterns)
- Why temperature matters (lower = more deterministic; higher = more variety)
Function call vs text decision
When the LLM has tool definitions in its prompt, it can choose to emit text OR a structured function call. The decision is made the same way: probability over tokens.
The model picks function call when:
- The user's input pattern-matches a tool's description.
- The system prompt instructed it to call functions for this kind of input.
- The conversation context indicates an action is needed.
It picks text when:
- The user's input is conversational.
- No tool fits.
- The system prompt instructed it to clarify or escalate.
You influence this primarily through tool descriptions and prompt rules.
How temperature affects voice replies
Temperature 0: always picks the highest-probability token. Replies are predictable; same input โ same output.
Temperature 0.7 (default for chat): introduces variety; same input might produce slightly different outputs.
For voice agents, temperature near 0 (0.0โ0.2) is usually right. Reasons:
- Reliability: you want the agent to behave consistently across calls.
- Function calls: low temperature improves function-call reliability.
- Brand voice: variability sounds inconsistent, not creative.
Higher temperature is fine for chat use cases where conversational variety is valued. For phone support, lower is better.
Why the same input can produce different replies
Even with low temperature, replies vary because:
- The conversation context is different (preceding turns differ).
- Retrieved context (RAG) returns different chunks each time.
- Function call results introduce variation.
- Sampling at temperature > 0 has inherent randomness.
If your agent is producing inconsistent replies on the same input with the same context at temperature 0, that's a bug in your system, not in the model.
What the LLM doesn't see
A few things the model doesn't have access to:
- Audio quality (it sees the transcript, not the audio).
- The user's tone of voice (unless you explicitly inject sentiment).
- The user's identity (unless you put it in the prompt or via a function call).
- Prior calls (unless your memory layer surfaces them).
- The current time (unless you inject it).
If you want the agent to consider any of these, put them in the prompt.
Why voice reasoning works differently
A few observations specific to voice:
Tight responses. The model has to learn to be brief because voice doesn't tolerate long responses. Your prompt should explicitly enforce this.
Function-first thinking. For voice, you almost always want the model to call a function before answering โ otherwise it answers from its general knowledge instead of looking up your specific data.
Recovery instincts. When the model isn't sure, it should say "let me check" rather than guess. This isn't natural; you have to train it via prompt.
For more on voice-specific prompting, see prompt engineering for voice (vs text) agents.
Debugging weird replies
When the agent says something unexpected:
- Look at the full prompt that was sent on that turn (system + history + retrieved context).
- Identify what in the prompt could have led to this reply.
- Either fix the prompt or accept that the model has variability you can't fully control.
Most "weird reply" mysteries are explainable from the prompt. Sometimes it's a missing rule; sometimes it's a conflict between two rules; sometimes it's just bad luck with sampling.
What's next
Three trends in how LLMs decide what to say:
Reasoning-augmented generation. Models that "think" before generating, exploring multiple possibilities. Better quality on hard turns; latency cost.
Tool-driven generation. Models that integrate tool outputs into their reasoning more deeply, not just at function-call boundaries.
Voice-aware models. Future models tuned specifically on conversation transcripts will likely be better at voice rhythm and turn-taking.
Related reading
- How Large Language Models Power Voice Agents
- Designing Voice Agents That Ask Better Questions
- Open-Source vs Closed-Source LLMs for Voice Agents
- Why Context Windows Matter Less Than You Think for Voice
- Multi-Agent Architectures for Customer Service
FAQ
Why does my agent sometimes ignore a clear rule in the prompt? LLMs are probabilistic. Clear rules raise the probability of compliance; they don't guarantee it. Tighten the wording, add an example.
Should I use chain-of-thought prompting in voice? Mostly no โ chain-of-thought adds latency. For voice, structured prompts beat thinking-aloud prompts.
Why does the same model perform differently on different prompts? The prompt shapes the probability distribution. A small change can shift the distribution significantly.
Are LLMs deterministic at temperature 0? Approximately. Floating-point arithmetic and batch effects can cause minor variation even at temp 0.
Can the model "decide" to escalate?
Yes โ if your prompt defines escalation criteria and you provide a transfer_to_human function, the model will pick it under the right conditions.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all โOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Designing Voice Agents That Ask Better Questions
A voice agent that asks bad questions wastes the caller's time and produces bad data. Good questions feel natural and capture what you need in fewer turns.
Open-Source vs Closed-Source LLMs for Voice Agents
The open-source LLM ecosystem caught up to closed models faster than anyone expected. Llama 3.3, Mistral, Qwen โ all good enough for most voice agent use cases.
Why Context Windows Matter Less Than You Think for Voice
LLM marketing has been all about context window expansion โ 128K, 200K, 1M, 2M tokens. For voice agents, this race mostly doesn't matter. Voice conversations rarely exceed 5,000 tokens of meaningful context.
Voice AI, twice a month.
Get the best of the SIMBA resources hub โ new articles, trend notes, and operator guides. No spam.
