πŸ”Š Speech Technology

How TTS Models Handle Numbers, Dates, and Acronyms

Numbers, dates, and acronyms are the trickiest content for TTS. "Dr. Smith will see you on 3/12/2026 for your $47.50 copay" seems simple until you realize the model has to decide: is "3/12" a date or a fraction? Is "$47.50" dollars or just numbers? Is "Dr." "Doctor" or "Drive"?

Tyler Weitzman
Tyler Weitzman
March 13, 2026 Β· 5 min read
Speechify

Numbers, dates, and acronyms are the trickiest content for TTS. "Dr. Smith will see you on 3/12/2026 for your $47.50 copay" seems simple until you realize the model has to decide: is "3/12" a date or a fraction? Is "$47.50" dollars or just numbers? Is "Dr." "Doctor" or "Drive"? Production voice agents handle these correctly thousands of times a day, but it takes specific engineering β€” both in the TTS model and in how you preprocess text.

TL;DR

  • Numbers, dates, and acronyms require text normalization before TTS.
  • Modern TTS handles most cases automatically but fails on edge cases.
  • Use SSML tags for explicit control.
  • Domain-specific pronunciation dictionaries matter.
  • Test with your actual content, not generic samples.

The problem space

TTS input can have:

  • Integers: "42"
  • Decimals: "3.14159"
  • Currency: "$47.50", "€30"
  • Dates: "03/12/2026", "March 12, 2026"
  • Times: "14:30", "2:30 PM"
  • Phone numbers: "555-1234", "+1-555-123-4567"
  • Percentages: "25%"
  • Acronyms: "API", "NASA", "HTTP"
  • Initialisms: "FBI" (read as letters) vs "NASA" (read as word)
  • Abbreviations: "Dr.", "Mr.", "Inc.", "St."

Each requires different pronunciation.

Text normalization

Before TTS, normalize:

  • "03/12/2026" β†’ "March 12, 2026"
  • "$47.50" β†’ "forty-seven dollars and fifty cents"
  • "25%" β†’ "twenty-five percent"
  • "Dr. Smith" β†’ "Doctor Smith"

Some TTS engines do this automatically; others require pre-processing.

How modern TTS handles it

High-end TTS (Simba, Cartesia, OpenAI) handles most cases:

  • Decimals: "3.14" β†’ "three point one four."
  • Currency: "$47.50" β†’ "forty-seven dollars and fifty cents."
  • Phone numbers: "555-1234" β†’ often "five five five, one two three four."
  • Dates: mostly correct.
  • Acronyms: correctly distinguished (usually).

But edge cases fail:

  • Ambiguous dates: "3/12" could be March 12 or 3 of 12.
  • Roman numerals: "Louis XIV" as "Louis fourteen" vs "X I V."
  • Industry-specific: "100 mg/dL" β€” how to read?
  • Phone-number formats that vary.

SSML: explicit control

SSML (Speech Synthesis Markup Language) lets you specify pronunciation:

<speak>
  Call me at <say-as interpret-as="telephone">5551234</say-as> 
  on <say-as interpret-as="date" format="mdy">03/12/2026</say-as>.
</speak>

Common SSML tags:

  • <say-as interpret-as="telephone">
  • <say-as interpret-as="date">
  • <say-as interpret-as="currency">
  • <say-as interpret-as="characters"> (spell out letters)
  • <say-as interpret-as="ordinal">
  • <phoneme alphabet="ipa" ph="..."> (explicit phonemes)

Support varies by TTS vendor.

Acronyms vs initialisms

  • Initialism: read as letters. "FBI" β†’ "F-B-I."
  • Acronym: read as word. "NASA" β†’ "Nassa."

Most modern TTS has a built-in dictionary but doesn't know all. Custom additions:

  • Your company name (especially if acronym).
  • Product names.
  • Industry-specific terms.

Phone numbers

Common formats:

  • "555-1234"
  • "(555) 123-4567"
  • "+1 555 123 4567"
  • "1-800-555-1234"

TTS should pause between groups. Test your formats.

Best practice:

  • Always pass in a consistent format.
  • Use SSML <say-as interpret-as="telephone"> for reliability.

Dates

Cultural variation:

  • US: MM/DD/YYYY.
  • Most of world: DD/MM/YYYY.

Ambiguous: "04/05/2026" = April 5 (US) or May 4 (EU).

Convert to unambiguous form before TTS:

  • "April 5, 2026" (explicit).
  • Or use SSML with date format specifier.

Currency

"$47.50" could be "forty-seven fifty" or "forty-seven dollars and fifty cents."

Modern TTS usually handles correctly. For non-USD:

  • "Β£100" β†’ "one hundred pounds."
  • "€30" β†’ "thirty euros."
  • "Β₯1000" β†’ "one thousand yen."

Test your currency formats.

Times

  • "14:30" β†’ "fourteen thirty" or "two-thirty PM."
  • "9:00 AM" β†’ "nine AM."

TTS usually handles. AM/PM vs 24-hour: convert to AM/PM for natural speech.

Percentages

"25%" β†’ "twenty-five percent." Usually correct.

Decimals in percentages: "3.5%" β†’ "three point five percent." Usually correct.

Fractions

  • "1/2" β†’ "one half" or "one slash two."
  • "3/4 cup" β†’ "three quarters cup."

Ambiguous; convert to words for safety.

Scientific notation

  • "1.5e10" β†’ "one point five times ten to the ten."

Rare in conversational context. If needed, preprocess.

Domain vocabulary

Industry-specific pronunciation:

  • Medical: drug names, procedures.
  • Legal: case names, Latin terms.
  • Financial: ticker symbols, company names.

Most TTS allow custom pronunciation dictionaries. Add domain terms.

Testing

Build a test set of edge cases:

  • Various date formats.
  • Currency amounts.
  • Phone numbers.
  • Acronyms.
  • Domain terms.

Run TTS on each; listen; fix with SSML or normalization as needed.

Preprocessing pipeline

Raw text:
"Your appointment is 3/12/2026 at 2:30 PM. Cost: $47.50."

Normalized:
"Your appointment is March 12, 2026 at 2:30 PM. Cost: 47 dollars and 50 cents."

Or with SSML:
<speak>
  Your appointment is <say-as interpret-as="date">2026-03-12</say-as>
  at <say-as interpret-as="time">14:30</say-as>.
  Cost: <say-as interpret-as="currency">$47.50</say-as>.
</speak>

Pipeline step: normalize β†’ synthesize.

Phoneme tuning

For stubborn pronunciations:

<phoneme alphabet="ipa" ph="tΙ™ΛˆmeΙͺtoʊ">tomato</phoneme>

Explicit phoneme specification. Works for unusual names and terms.

See phoneme-level tuning for voice agents.

Caching

Common phrases can be pre-synthesized:

  • Welcome greetings.
  • Closing phrases.
  • Menu options.

Skip TTS for these. Use pre-recorded audio for those slots.

Common pitfalls

Assuming TTS handles everything. Ambiguous inputs produce ambiguous output. Normalize.

Wrong locale. US date format in EU locale. Confusing.

No SSML for tricky inputs. Silently wrong pronunciations.

Untested edge cases. "Your SSN ending in 1234" β€” TTS reads 1-2-3-4 or one thousand two hundred thirty-four?

Domain blind spots. Medical terms mispronounced. Legal names wrong.

FAQ

Does normalization add latency? Minimal β€” microseconds for simple regex-based normalization.

Can we use LLM to normalize? Some teams do. Adds latency; marginal quality improvement.

What about audio pronunciation of codes (order numbers, confirmation codes)? Use SSML interpret-as="characters" for letter-by-letter.

How does TTS handle names? Usually OK for common names. Uncommon names need phoneme or dictionary support.

What about multilingual numbers? "One hundred" vs "cien" β€” match TTS voice language.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems β€” text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all β†’

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub β€” new articles, trend notes, and operator guides. No spam.