🎙️ Voice AI Fundamentals

The Hidden Complexity of Numbers in Voice Agents

Numbers are the most underestimated source of pain in voice AI. Phone numbers, account numbers, dates, prices, addresses — all of them have edge cases that turn a clean conversation into a back-and-forth of "no, one nine seven, not nineteen seven." The fix isn't a better LLM;…

Tyler Weitzman
Tyler Weitzman
January 13, 2026 · 6 min read
Speechify

Numbers are the most underestimated source of pain in voice AI. Phone numbers, account numbers, dates, prices, addresses — all of them have edge cases that turn a clean conversation into a back-and-forth of "no, one nine seven, not nineteen seven." The fix isn't a better LLM; it's understanding why numbers are hard and engineering around it.

TL;DR

  • STT systems mis-transcribe numbers more often than words because there's no language-model context to disambiguate.
  • TTS systems mis-pronounce numbers because the right way to say "1976" depends on whether it's a year, a price, an address, or a phone number.
  • The fix on input: custom STT vocabularies plus DTMF for high-precision capture.
  • The fix on output: explicit pronunciation hints in the prompt.

Why numbers are hard for STT

Spoken language uses context to disambiguate ambiguous words. "Their" vs "there" vs "they're" — humans figure it out from surrounding meaning.

Numbers don't have that. "Two zero one five" could be:

  • The year 2015
  • An address (2015 Main St)
  • A part of a phone number
  • An account ID

STT systems guess based on training data biases and surrounding words. They're often wrong on the first turn, requiring a confirm-back.

Common failure patterns:

  • "Eighteen" vs "eighty"
  • "Fifteen" vs "fifty"
  • Phone numbers with rhythm breaks ("five five five — pause — one two three four")
  • Long sequences ("my account number is one nine seven six four three two zero")
  • Numbers with letters ("apartment 4B")

Why numbers are hard for TTS

Same ambiguity, in reverse. The text "1976" could be pronounced:

  • "Nineteen seventy-six" (year)
  • "One thousand nine hundred seventy-six" (count)
  • "One nine seven six" (digits, like an account number)
  • "Nineteen-seventy-six" (street address)

Most TTS systems guess based on surrounding text. Sometimes wrong. The classic example: an address being read as a year, or a phone number being pronounced as a single number.

What to do on input

Mitigations in order of impact:

Custom STT vocabulary. Bias your STT toward expected number formats. If you know account numbers are always 10 digits, tell the model. Cuts errors significantly.

Confirm-back. For high-stakes numbers (account numbers, dates, dollar amounts), always read back. "So that's account number one nine seven six four, correct?"

DTMF capture. For credit cards, account numbers, PINs — let the caller punch them in. STT will never beat keypad input on these.

Spelled-out alphabet. For names: "Can you spell that for me?" Then walk through letter by letter. Not for numbers, but the same pattern applies.

Multi-pass transcription. Some platforms can return both the streaming partial and a higher-quality batch transcription. Use the batch for critical numbers.

What to do on output

For TTS, the right approach depends on your platform:

SSML hints. Most TTS engines support Speech Synthesis Markup Language. Use <say-as interpret-as="characters">1976</say-as> to force digit-by-digit reading.

Prompt-level instructions. Tell the LLM what format to use. "When confirming a phone number, say each digit individually with brief pauses."

Pre-formatting. Convert numbers to a phonetic format in your code before sending to TTS. "1976" → "one nine seven six" if you want it read as digits.

Test on your specific numbers. TTS quality on numbers varies wildly by provider. Test with your actual data before locking in a vendor.

Specific number gotchas

A few patterns worth special attention:

Phone numbers. Standard format: pause every 3–4 digits. Bad: read as one long sequence. The fix: insert SSML pauses or pre-format with hyphens.

Currency. "$1,234.56" should be "one thousand two hundred thirty-four dollars and fifty-six cents." Most TTS handles this OK, but international currency formats (€, £, ¥) sometimes don't.

Dates. "01/02/2026" is January 2 in the US, February 1 in most of the world. Both should be spoken as "January second, two thousand twenty-six" to be unambiguous.

Times. "3:30 PM" should be "three thirty PM" not "three colon three zero PM." Most TTS handles this; verify on your provider.

Decimals. "3.14" should be "three point one four" — and most TTS gets this right. But "3.0" is sometimes read as "three" instead of "three point zero," which loses information.

Negative numbers. "-5" should be "negative five." Some TTS reads it as "minus five" or "five." Test.

Percentages. "50%" → "fifty percent." Usually fine.

Sports scores, historical years. Context-dependent and often wrong. Test for your domain.

The credit card pattern

A specific case worth its own callout. Capturing credit cards over voice has three issues: STT accuracy, regulatory compliance, and customer trust.

The right pattern:

  1. Trigger DTMF capture: the agent prompts the caller to enter the card number on the keypad.
  2. The DTMF stream goes to a payment processor that masks the digits before they reach your logs.
  3. The agent never "sees" the card number directly.
  4. Confirm the last 4 digits back to the caller.

Voice STT for credit cards is technically possible but operationally a bad idea — high error rate plus PCI exposure.

How to test

A useful exercise for any new agent: record yourself reading 50 representative numbers (account IDs, phone numbers, dates, amounts) into the agent. Note which ones the agent mis-transcribed. Add those patterns to your custom vocabulary or confirm-back rules.

Do the same on output: have the agent read 50 representative numbers back. Note the ones it mispronounced. Add SSML hints or prompt instructions.

You'll cut your number-related errors by 50% in an afternoon.

FAQ

Why doesn't the LLM fix the STT errors? The LLM gets the transcript after STT runs. If "1976" was transcribed as "nineteen seventy-six," the LLM doesn't know to question it.

Can I just use DTMF for everything? For high-value captures, yes. For natural conversation, no — making customers punch in everything kills the experience.

What about international phone numbers? Custom vocabulary that knows your region's number format helps. Multi-region deployments often need per-region biasing.

Is GPT-4o's audio mode better at numbers? Marginally — end-to-end audio models have somewhat better number handling because they don't lose information at the STT step. The gain is real but modest.

How important is this for chat agents? Mostly not — text input doesn't have STT errors. Numbers in TTS still matter if the chat agent has voice mode.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all →

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.