Designing System Prompts for Multi-Turn Voice Conversations
The system prompt is the single most-iterated artifact in any production voice agent. It's where most of the agent's personality, rules, and reliability live. Most teams underinvest here, treating the prompt as a "set it and forget it" string.
The system prompt is the single most-iterated artifact in any production voice agent. It's where most of the agent's personality, rules, and reliability live. Most teams underinvest here, treating the prompt as a "set it and forget it" string. The teams shipping good voice agents treat prompts as production code, with versioning, testing, and discipline.
TL;DR
- A good voice agent prompt has six sections: identity, goal, tools, voice style, hard rules, escalation.
- 800โ1500 tokens total. Longer wastes TTFT; shorter usually means missing rules.
- Iterate via A/B test against an eval set, not by intuition.
- The biggest mistakes: vague rules, conflicting rules, missing voice style guide.
The six sections
Walking through them in order:
Identity (50โ100 tokens)
Who is the agent. Be specific.
You are Maya, the receptionist at Cornerstone Dental Group, a
12-dentist practice in Boston with locations in Cambridge,
Newton, and downtown.
Include enough context for the agent to answer "where are you located?" without a function call.
Goal (1-3 sentences)
The single most important sentence in the prompt. What is this call for?
Your job is to book new-patient appointments, reschedule
existing ones, and answer simple questions about insurance
acceptance and office locations. For anything else, escalate
to a human.
Without this, the agent tries to do everything and does nothing well.
Tools (50โ200 tokens, depending on count)
List every function the agent can call, with a clear description per function. The descriptions matter as much as the names โ they're what the LLM uses to decide when to call each.
Tools you can call:
get_available_slots(date_range)
Returns a list of open appointment slots in the given date range.
Call this when the caller wants to book or reschedule and you
need to know what's available.
book_appointment(slot_time, caller_name, reason)
Books the appointment. Returns a confirmation number. Call this
only after confirming the slot, name, and reason with the caller.
lookup_caller_by_phone(phone_number)
Returns the caller's existing patient record. Call this near the
start of the call to recognize repeat callers.
transfer_to_human(reason)
Transfers the call to a human receptionist. Call this if the
caller is frustrated, asks for a manager, or asks about anything
outside scheduling.
Voice style guide (200โ400 tokens)
The rules that make the agent sound human. Six to ten rules:
Speaking style:
- Use short sentences. One main clause per sentence.
- Never use bullet points, numbered lists, or formatting.
- When confirming a date or phone number, say each part slowly
and pause briefly between digits/numbers.
- When you're about to do something that takes more than 1.5
seconds (function calls), say "let me check on that" or
"one moment" first.
- Never read more than three options aloud in a row. If you
have more, summarize ("I have several morning slots open โ
would you like me to list them?").
- Don't start every reply with "Sure" or "Of course" โ vary
your acknowledgments naturally.
- If the caller corrects you, acknowledge briefly ("apologies
โ let me update that") and continue.
Hard rules (100โ300 tokens)
Things to never do, things to always do.
Hard rules:
- Never quote prices. If asked, say "for pricing questions
I'll need to transfer you to our office staff."
- Never give medical advice. If asked, say "I'm not qualified
to advise on that โ you should speak with the dentist."
- Never claim to be human. If asked directly, say "I'm an
AI assistant for Cornerstone Dental โ happy to help with
scheduling."
- Always confirm appointment details by reading them back
before booking.
Escalation (50โ150 tokens)
When and how to hand off.
Escalation:
Transfer to a human via transfer_to_human if:
- The caller is upset or asks for a manager.
- The caller asks about clinical details (symptoms, treatment
options).
- The caller asks about insurance disputes or billing.
- You can't understand the caller after two clarification
attempts.
When transferring, briefly summarize the call to the caller
("Thanks โ I'm connecting you to one of our office staff
who can help with that. One moment.") then call the function.
Total length
Sum: roughly 600โ1400 tokens for a typical agent. Add 100โ300 if you have many functions or a complex use case.
If your prompt is over 2000 tokens, you're probably bloating it. Read through and cut.
Versioning
Treat prompts as code:
- Store them in your repo (or your platform's version control).
- Version every change.
- Document the reason for each change in the commit message.
- A/B test changes before shipping.
Iteration patterns
A few practical patterns for iterating:
The "what went wrong" log. When you observe a bad call, write down what the agent should have done. Use that as the basis for a new rule.
The "tighten or loosen" check. Each rule should have a clear effect. If you can't articulate the effect, the rule probably doesn't help.
The "delete first" instinct. Before adding a new rule, see if you can rephrase an existing one.
The "examples earn their tokens" rule. Each example in the prompt costs tokens on every turn. Only include examples that demonstrate something the description couldn't.
Common mistakes
Patterns I see repeatedly:
Vague rules. "Be polite." Doesn't help โ the agent doesn't know what polite means in your context. Better: "Use the caller's name once they share it. Acknowledge their issue before redirecting."
Conflicting rules. "Be empathetic" + "Don't make commitments." In some scenarios these conflict. Resolve the conflict explicitly.
Untested assumptions. "The agent will figure it out from context." Maybe. Test.
Forgetting voice style. Most prompt failures come from missing voice-specific style guidance. The agent reads aloud "bullet point one" because nothing told it not to.
Missing escalation criteria. If the agent doesn't know when to escalate, it'll either escalate too much or too little. Both bad.
For deeper iteration practices, see how to A/B test voice agent prompts.
Related reading
- How Large Language Models Power Voice Agents
- Designing Voice Agents That Ask Better Questions
- Open-Source vs Closed-Source LLMs for Voice Agents
- How LLMs Decide What to Say Next in a Voice Conversation
- Why Context Windows Matter Less Than You Think for Voice
FAQ
How long should my prompt be? 800โ1500 tokens for most production agents. Aim for the lower end.
Should the prompt include examples? 1โ3 well-chosen examples for each major behavior. Don't include 50.
Can I use markdown formatting in the prompt? The model can read it. Whether you should depends on the model โ some follow plain prose better.
Should I write the prompt in second person ("you") or third person? Second person is standard. More direct.
How often should I update my prompt? Whenever you observe a recurring pattern that the prompt should handle. Most production agents see 2โ10 changes per month after launch.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all โOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Designing Voice Agents That Ask Better Questions
A voice agent that asks bad questions wastes the caller's time and produces bad data. Good questions feel natural and capture what you need in fewer turns.
Open-Source vs Closed-Source LLMs for Voice Agents
The open-source LLM ecosystem caught up to closed models faster than anyone expected. Llama 3.3, Mistral, Qwen โ all good enough for most voice agent use cases.
How LLMs Decide What to Say Next in a Voice Conversation
Step inside the LLM's "head" for a moment and look at how it picks what to say on each turn of a voice call. The answer is less mysterious than the term "AI" suggests and more interesting than "next-token prediction" implies.
Voice AI, twice a month.
Get the best of the SIMBA resources hub โ new articles, trend notes, and operator guides. No spam.
