The Hidden Complexity of Numbers in Voice Agents
Numbers are the most underestimated source of pain in voice AI. Phone numbers, account numbers, dates, prices, addresses — all of them have edge cases that turn a clean conversation into a back-and-forth of "no, one nine seven, not nineteen seven." The fix isn't a better LLM;…
Numbers are the most underestimated source of pain in voice AI. Phone numbers, account numbers, dates, prices, addresses — all of them have edge cases that turn a clean conversation into a back-and-forth of "no, one nine seven, not nineteen seven." The fix isn't a better LLM; it's understanding why numbers are hard and engineering around it.
TL;DR
- STT systems mis-transcribe numbers more often than words because there's no language-model context to disambiguate.
- TTS systems mis-pronounce numbers because the right way to say "1976" depends on whether it's a year, a price, an address, or a phone number.
- The fix on input: custom STT vocabularies plus DTMF for high-precision capture.
- The fix on output: explicit pronunciation hints in the prompt.
Why numbers are hard for STT
Spoken language uses context to disambiguate ambiguous words. "Their" vs "there" vs "they're" — humans figure it out from surrounding meaning.
Numbers don't have that. "Two zero one five" could be:
- The year 2015
- An address (2015 Main St)
- A part of a phone number
- An account ID
STT systems guess based on training data biases and surrounding words. They're often wrong on the first turn, requiring a confirm-back.
Common failure patterns:
- "Eighteen" vs "eighty"
- "Fifteen" vs "fifty"
- Phone numbers with rhythm breaks ("five five five — pause — one two three four")
- Long sequences ("my account number is one nine seven six four three two zero")
- Numbers with letters ("apartment 4B")
Why numbers are hard for TTS
Same ambiguity, in reverse. The text "1976" could be pronounced:
- "Nineteen seventy-six" (year)
- "One thousand nine hundred seventy-six" (count)
- "One nine seven six" (digits, like an account number)
- "Nineteen-seventy-six" (street address)
Most TTS systems guess based on surrounding text. Sometimes wrong. The classic example: an address being read as a year, or a phone number being pronounced as a single number.
What to do on input
Mitigations in order of impact:
Custom STT vocabulary. Bias your STT toward expected number formats. If you know account numbers are always 10 digits, tell the model. Cuts errors significantly.
Confirm-back. For high-stakes numbers (account numbers, dates, dollar amounts), always read back. "So that's account number one nine seven six four, correct?"
DTMF capture. For credit cards, account numbers, PINs — let the caller punch them in. STT will never beat keypad input on these.
Spelled-out alphabet. For names: "Can you spell that for me?" Then walk through letter by letter. Not for numbers, but the same pattern applies.
Multi-pass transcription. Some platforms can return both the streaming partial and a higher-quality batch transcription. Use the batch for critical numbers.
What to do on output
For TTS, the right approach depends on your platform:
SSML hints. Most TTS engines support Speech Synthesis Markup Language. Use <say-as interpret-as="characters">1976</say-as> to force digit-by-digit reading.
Prompt-level instructions. Tell the LLM what format to use. "When confirming a phone number, say each digit individually with brief pauses."
Pre-formatting. Convert numbers to a phonetic format in your code before sending to TTS. "1976" → "one nine seven six" if you want it read as digits.
Test on your specific numbers. TTS quality on numbers varies wildly by provider. Test with your actual data before locking in a vendor.
Specific number gotchas
A few patterns worth special attention:
Phone numbers. Standard format: pause every 3–4 digits. Bad: read as one long sequence. The fix: insert SSML pauses or pre-format with hyphens.
Currency. "$1,234.56" should be "one thousand two hundred thirty-four dollars and fifty-six cents." Most TTS handles this OK, but international currency formats (€, £, ¥) sometimes don't.
Dates. "01/02/2026" is January 2 in the US, February 1 in most of the world. Both should be spoken as "January second, two thousand twenty-six" to be unambiguous.
Times. "3:30 PM" should be "three thirty PM" not "three colon three zero PM." Most TTS handles this; verify on your provider.
Decimals. "3.14" should be "three point one four" — and most TTS gets this right. But "3.0" is sometimes read as "three" instead of "three point zero," which loses information.
Negative numbers. "-5" should be "negative five." Some TTS reads it as "minus five" or "five." Test.
Percentages. "50%" → "fifty percent." Usually fine.
Sports scores, historical years. Context-dependent and often wrong. Test for your domain.
The credit card pattern
A specific case worth its own callout. Capturing credit cards over voice has three issues: STT accuracy, regulatory compliance, and customer trust.
The right pattern:
- Trigger DTMF capture: the agent prompts the caller to enter the card number on the keypad.
- The DTMF stream goes to a payment processor that masks the digits before they reach your logs.
- The agent never "sees" the card number directly.
- Confirm the last 4 digits back to the caller.
Voice STT for credit cards is technically possible but operationally a bad idea — high error rate plus PCI exposure.
How to test
A useful exercise for any new agent: record yourself reading 50 representative numbers (account IDs, phone numbers, dates, amounts) into the agent. Note which ones the agent mis-transcribed. Add those patterns to your custom vocabulary or confirm-back rules.
Do the same on output: have the agent read 50 representative numbers back. Note the ones it mispronounced. Add SSML hints or prompt instructions.
You'll cut your number-related errors by 50% in an afternoon.
Related reading
- The Anatomy of a Voice Agent Pipeline
- How a Conversational Voice Agent Actually Works (Under the Hood)
- How Voice Agents Handle Accents and Dialects
- How Voice Agents Recover from Misunderstandings
- Why Voice Agents Sound More Human Every Year
FAQ
Why doesn't the LLM fix the STT errors? The LLM gets the transcript after STT runs. If "1976" was transcribed as "nineteen seventy-six," the LLM doesn't know to question it.
Can I just use DTMF for everything? For high-value captures, yes. For natural conversation, no — making customers punch in everything kills the experience.
What about international phone numbers? Custom vocabulary that knows your region's number format helps. Multi-region deployments often need per-region biasing.
Is GPT-4o's audio mode better at numbers? Marginally — end-to-end audio models have somewhat better number handling because they don't lose information at the STT step. The gain is real but modest.
How important is this for chat agents? Mostly not — text input doesn't have STT errors. Numbers in TTS still matter if the chat agent has voice mode.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all →Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
The Anatomy of a Voice Agent Pipeline
If you took every voice agent in production today and dissected them, you'd find roughly the same skeleton. The names change. The vendors change. The plumbing details vary.
How a Conversational Voice Agent Actually Works (Under the Hood)
If you open the box on a modern voice agent, you'll find roughly four moving parts: a streaming speech recognizer, a language model, a text-to-speech engine, and a turn-taking referee that decides whose turn it is to speak. None of that is exotic on its own.
How Voice Agents Handle Accents and Dialects
Voice AI is great at standard American English. It's pretty good at standard British, Australian, and Indian English. It's variably good at everything else.
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
