Step inside the LLM's "head" for a moment and look at how it picks what to say on each turn of a voice call. The answer is less mysterious than the term "AI" suggests and more interesting than "next-token prediction" implies. Understanding it helps you write better prompts, debug weird replies, and predict where the model will go off the rails.

TL;DR

The LLM picks tokens one at a time, predicting the most likely next token based on everything before.
"Everything before" includes the system prompt, conversation history, function definitions, and any retrieved context.
For function calls vs free text, the model chooses based on prompt structure and tool descriptions.
For voice specifically: temperature near zero, tight prompts, and clear instructions matter more than for chat.

The basic mechanism

LLMs generate text token by token. On each step:

Look at all preceding tokens (the entire prompt + everything generated so far).
Compute a probability distribution over the next possible token.
Sample from that distribution (or pick the highest-probability token if temperature is 0).
Append the chosen token; repeat until a stop condition.

For voice agents, this happens 50–200 times per turn (one per token of the reply).

What the LLM is "thinking about"

It's not thinking about what's true. It's predicting what text would come next given the patterns it learned during training.

This explains a lot:

Why hallucinations happen (the most "plausible" next token isn't always factually correct)
Why tight prompts work (they constrain the high-probability paths)
Why examples in the prompt help (they shift the probability distribution toward your patterns)
Why temperature matters (lower = more deterministic; higher = more variety)

Function call vs text decision

When the LLM has tool definitions in its prompt, it can choose to emit text OR a structured function call. The decision is made the same way: probability over tokens.

The model picks function call when:

The user's input pattern-matches a tool's description.
The system prompt instructed it to call functions for this kind of input.
The conversation context indicates an action is needed.

It picks text when:

The user's input is conversational.
No tool fits.
The system prompt instructed it to clarify or escalate.

You influence this primarily through tool descriptions and prompt rules.

How temperature affects voice replies

Temperature 0: always picks the highest-probability token. Replies are predictable; same input → same output.

Temperature 0.7 (default for chat): introduces variety; same input might produce slightly different outputs.

For voice agents, temperature near 0 (0.0–0.2) is usually right. Reasons:

Reliability: you want the agent to behave consistently across calls.
Function calls: low temperature improves function-call reliability.
Brand voice: variability sounds inconsistent, not creative.

Higher temperature is fine for chat use cases where conversational variety is valued. For phone support, lower is better.

Why the same input can produce different replies

Even with low temperature, replies vary because:

The conversation context is different (preceding turns differ).
Retrieved context (RAG) returns different chunks each time.
Function call results introduce variation.
Sampling at temperature > 0 has inherent randomness.

If your agent is producing inconsistent replies on the same input with the same context at temperature 0, that's a bug in your system, not in the model.

What the LLM doesn't see

A few things the model doesn't have access to:

Audio quality (it sees the transcript, not the audio).
The user's tone of voice (unless you explicitly inject sentiment).
The user's identity (unless you put it in the prompt or via a function call).
Prior calls (unless your memory layer surfaces them).
The current time (unless you inject it).

If you want the agent to consider any of these, put them in the prompt.

Why voice reasoning works differently

A few observations specific to voice:

Tight responses. The model has to learn to be brief because voice doesn't tolerate long responses. Your prompt should explicitly enforce this.

Function-first thinking. For voice, you almost always want the model to call a function before answering — otherwise it answers from its general knowledge instead of looking up your specific data.

Recovery instincts. When the model isn't sure, it should say "let me check" rather than guess. This isn't natural; you have to train it via prompt.

For more on voice-specific prompting, see prompt engineering for voice (vs text) agents.

Debugging weird replies

When the agent says something unexpected:

Look at the full prompt that was sent on that turn (system + history + retrieved context).
Identify what in the prompt could have led to this reply.
Either fix the prompt or accept that the model has variability you can't fully control.

Most "weird reply" mysteries are explainable from the prompt. Sometimes it's a missing rule; sometimes it's a conflict between two rules; sometimes it's just bad luck with sampling.

What's next

Three trends in how LLMs decide what to say:

Reasoning-augmented generation. Models that "think" before generating, exploring multiple possibilities. Better quality on hard turns; latency cost.

Tool-driven generation. Models that integrate tool outputs into their reasoning more deeply, not just at function-call boundaries.

Voice-aware models. Future models tuned specifically on conversation transcripts will likely be better at voice rhythm and turn-taking.

FAQ

Why does my agent sometimes ignore a clear rule in the prompt? LLMs are probabilistic. Clear rules raise the probability of compliance; they don't guarantee it. Tighten the wording, add an example.

Should I use chain-of-thought prompting in voice? Mostly no — chain-of-thought adds latency. For voice, structured prompts beat thinking-aloud prompts.

Why does the same model perform differently on different prompts? The prompt shapes the probability distribution. A small change can shift the distribution significantly.

Are LLMs deterministic at temperature 0? Approximately. Floating-point arithmetic and batch effects can cause minor variation even at temp 0.

Can the model "decide" to escalate? Yes — if your prompt defines escalation criteria and you provide a transfer_to_human function, the model will pick it under the right conditions.

How LLMs Decide What to Say Next in a Voice Conversation

TL;DR

The basic mechanism

What the LLM is "thinking about"

Function call vs text decision

How temperature affects voice replies

Why the same input can produce different replies

What the LLM doesn't see

Why voice reasoning works differently

Debugging weird replies

What's next

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

Designing Voice Agents That Ask Better Questions

Open-Source vs Closed-Source LLMs for Voice Agents

Why Context Windows Matter Less Than You Think for Voice

Voice AI, twice a month.