Prompt Engineering for Voice (vs Text) Agents
If you've written prompts for chatbots, you have a head start on voice agents — but only halfway. The fundamentals of clear instructions and tool definitions carry over. The style guide, the latency considerations, and the failure-mode handling are very different.
If you've written prompts for chatbots, you have a head start on voice agents — but only halfway. The fundamentals of clear instructions and tool definitions carry over. The style guide, the latency considerations, and the failure-mode handling are very different. This is the delta.
TL;DR
- Voice prompts are shorter, more terse, and explicitly forbid formatting.
- Voice prompts include pacing instructions (sentence length, pauses, when to bridge with chitchat).
- Voice prompts have to handle interruptions, restarts, and audio quality issues that don't exist in chat.
- The same model gets different prompts for voice vs chat; don't reuse without adjustment.
What stays the same
Both voice and chat agents need:
- Clear identity and role
- Defined goals
- Tool/function definitions
- Hard rules (what not to do)
- Escalation criteria
These transfer between channels with minimal change.
What's different about voice prompts
Six categories of difference:
1. Forbid visual formatting
Chat agents can use bullets, headers, code blocks. Voice agents can't. Add explicit rules:
Never use bullet points, numbered lists, or markdown formatting.
Speak in conversational sentences only. If you need to convey
multiple items, use natural conversational structure ("first",
"then", "also").
Without this, the agent will sometimes read aloud "bullet point 1, bullet point 2" — embarrassing.
2. Constrain sentence length
Long sentences land badly in voice. The listener loses track.
Use short sentences. One main clause per sentence ideally.
If you need to convey complex info, break it into 2-3 short
sentences with brief pauses, not one long sentence with
multiple clauses.
3. Specify pacing for slow operations
Voice has a real-time clock that chat doesn't. When the agent's about to do something slow, it should bridge:
When you call a function that may take more than 1.5 seconds,
first say something brief to the caller ("let me check on that"
or "one moment, looking that up"). This keeps the conversation
alive while the function runs.
4. Number and date pronunciation
Tell the agent how to say numbers and dates aloud:
When confirming a phone number, account number, or PIN to the
caller, say each digit individually with brief pauses, like
"that's one, nine, seven, six". When confirming a date, say it
in natural form ("Tuesday the fifteenth at three PM"), not as
a slash-formatted date.
5. Recovery and repair patterns
Voice has more misunderstandings than chat. Pre-write the recovery moves:
If the caller corrects you (says "no" or "actually I meant..."),
acknowledge briefly ("apologies — let me update that") and
update your understanding. Don't argue with the correction.
If the caller says something you don't understand, ask one
clarifying question. If you still can't understand on the
second try, escalate to a human.
6. Handling silence
Chat doesn't have "the user went silent." Voice does. Tell the agent how to handle it:
If the caller hasn't spoken in 5+ seconds, ask if they're still
there ("are you still with me?"). After 15 seconds of no
response, end the call gracefully ("looks like we got
disconnected — feel free to call back").
What gets shorter
Voice prompts are usually 30–50% shorter than equivalent chat prompts. Reasons:
- No visual formatting allows for tighter wording.
- Each token costs measurable TTFT latency.
- Voice prompts are more focused (one bounded use case vs general chat).
A typical voice agent system prompt: 800–1500 tokens. A typical chat agent: 2000–4000.
What gets longer
A few sections that grow specifically for voice:
Voice style guide. Explicit rules about pacing, sentence length, formatting forbiddances. 200–400 tokens.
Function-call latency hints. Telling the agent which functions are slow and to bridge. 100–200 tokens.
Recovery patterns. Pre-written correction handling. 100–300 tokens.
A complete sample structure
For a voice agent:
[Identity — 2 sentences]
You are Maya, the receptionist at Cornerstone Dental Group.
[Goal — 1 sentence]
Your job is to book new-patient appointments and reschedule
existing ones.
[Voice style — 6-10 rules]
- Speak in short sentences. One main clause each.
- No bullets, no formatting, no headers.
- Confirm appointment times by reading them back digit by digit.
- When calling a function that may be slow, say "one moment"
first.
- (etc.)
[Tools — 3-5 functions, each with name + description]
get_available_slots(date_range)
book_appointment(slot_time, caller_id, reason)
lookup_caller_by_phone(phone)
transfer_to_human(reason)
[Hard rules — 4-8 things never to do]
- Never quote prices.
- Never agree to a refund.
- Never give medical advice.
- (etc.)
[Recovery patterns — 2-4 examples]
If the caller corrects you, acknowledge briefly and update.
If you can't understand after 2 tries, escalate.
[Escalation — when and how]
Transfer to a human if the caller is upset, asks for a manager,
or asks about anything outside scheduling.
[Greeting — the first line]
"Hi, this is Maya from Cornerstone Dental. How can I help?"
Total: ~1000 tokens. Adjust for your use case.
Iteration discipline
The system prompt is the most-iterated artifact in any voice agent. Some tactics:
Version it. Track which prompt was live for which calls.
A/B test. Run two versions in parallel; compare on your eval set. See how to A/B test voice agent prompts.
Don't tune blindly. When you change the prompt, replay 20 historical calls through both versions and compare.
Keep a changelog. "Added rule about reading account numbers digit-by-digit because of recurring complaints."
Related reading
- How Large Language Models Power Voice Agents
- Designing Voice Agents That Ask Better Questions
- Open-Source vs Closed-Source LLMs for Voice Agents
- How LLMs Decide What to Say Next in a Voice Conversation
- Why Context Windows Matter Less Than You Think for Voice
FAQ
Should I write the same prompt for voice and chat? No. Start with the chat version, but rewrite the style guide and add the voice-specific recovery patterns.
How long should a voice prompt be? 800–1500 tokens for most production agents. Longer is wasted; shorter usually means you're missing rules.
Should the prompt include examples? A few well-chosen examples (1–3) help. Don't include 50 — that's what fine-tuning is for.
How do I know if a rule is working? Replay calls through the prompt and check whether the rule's outcome shows up. If you can't tell, the rule is too vague.
Can the LLM change tone mid-conversation? Yes — your prompt can include conditional rules ("if the caller seems frustrated, slow down and acknowledge their frustration before continuing").

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all →Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Designing Voice Agents That Ask Better Questions
A voice agent that asks bad questions wastes the caller's time and produces bad data. Good questions feel natural and capture what you need in fewer turns.
Open-Source vs Closed-Source LLMs for Voice Agents
The open-source LLM ecosystem caught up to closed models faster than anyone expected. Llama 3.3, Mistral, Qwen — all good enough for most voice agent use cases.
How LLMs Decide What to Say Next in a Voice Conversation
Step inside the LLM's "head" for a moment and look at how it picks what to say on each turn of a voice call. The answer is less mysterious than the term "AI" suggests and more interesting than "next-token prediction" implies.
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
