Numbers, dates, and acronyms are the trickiest content for TTS. "Dr. Smith will see you on 3/12/2026 for your $47.50 copay" seems simple until you realize the model has to decide: is "3/12" a date or a fraction? Is "$47.50" dollars or just numbers? Is "Dr." "Doctor" or "Drive"? Production voice agents handle these correctly thousands of times a day, but it takes specific engineering — both in the TTS model and in how you preprocess text.

TL;DR

Numbers, dates, and acronyms require text normalization before TTS.
Modern TTS handles most cases automatically but fails on edge cases.
Use SSML tags for explicit control.
Domain-specific pronunciation dictionaries matter.
Test with your actual content, not generic samples.

The problem space

TTS input can have:

Integers: "42"
Decimals: "3.14159"
Currency: "$47.50", "€30"
Dates: "03/12/2026", "March 12, 2026"
Times: "14:30", "2:30 PM"
Phone numbers: "555-1234", "+1-555-123-4567"
Percentages: "25%"
Acronyms: "API", "NASA", "HTTP"
Initialisms: "FBI" (read as letters) vs "NASA" (read as word)
Abbreviations: "Dr.", "Mr.", "Inc.", "St."

Each requires different pronunciation.

Text normalization

Before TTS, normalize:

"03/12/2026" → "March 12, 2026"
"$47.50" → "forty-seven dollars and fifty cents"
"25%" → "twenty-five percent"
"Dr. Smith" → "Doctor Smith"

Some TTS engines do this automatically; others require pre-processing.

How modern TTS handles it

High-end TTS (Simba, Cartesia, OpenAI) handles most cases:

Decimals: "3.14" → "three point one four."
Currency: "$47.50" → "forty-seven dollars and fifty cents."
Phone numbers: "555-1234" → often "five five five, one two three four."
Dates: mostly correct.
Acronyms: correctly distinguished (usually).

But edge cases fail:

Ambiguous dates: "3/12" could be March 12 or 3 of 12.
Roman numerals: "Louis XIV" as "Louis fourteen" vs "X I V."
Industry-specific: "100 mg/dL" — how to read?
Phone-number formats that vary.

SSML: explicit control

SSML (Speech Synthesis Markup Language) lets you specify pronunciation:

<speak>
  Call me at <say-as interpret-as="telephone">5551234</say-as> 
  on <say-as interpret-as="date" format="mdy">03/12/2026</say-as>.
</speak>

Common SSML tags:

<say-as interpret-as="telephone">
<say-as interpret-as="date">
<say-as interpret-as="currency">
<say-as interpret-as="characters"> (spell out letters)
<say-as interpret-as="ordinal">
<phoneme alphabet="ipa" ph="..."> (explicit phonemes)

Support varies by TTS vendor.

Acronyms vs initialisms

Initialism: read as letters. "FBI" → "F-B-I."
Acronym: read as word. "NASA" → "Nassa."

Most modern TTS has a built-in dictionary but doesn't know all. Custom additions:

Your company name (especially if acronym).
Product names.
Industry-specific terms.

Phone numbers

Common formats:

"555-1234"
"(555) 123-4567"
"+1 555 123 4567"
"1-800-555-1234"

TTS should pause between groups. Test your formats.

Best practice:

Always pass in a consistent format.
Use SSML <say-as interpret-as="telephone"> for reliability.

Dates

Cultural variation:

US: MM/DD/YYYY.
Most of world: DD/MM/YYYY.

Ambiguous: "04/05/2026" = April 5 (US) or May 4 (EU).

Convert to unambiguous form before TTS:

"April 5, 2026" (explicit).
Or use SSML with date format specifier.

Currency

"$47.50" could be "forty-seven fifty" or "forty-seven dollars and fifty cents."

Modern TTS usually handles correctly. For non-USD:

"£100" → "one hundred pounds."
"€30" → "thirty euros."
"¥1000" → "one thousand yen."

Test your currency formats.

Times

"14:30" → "fourteen thirty" or "two-thirty PM."
"9:00 AM" → "nine AM."

TTS usually handles. AM/PM vs 24-hour: convert to AM/PM for natural speech.

Percentages

"25%" → "twenty-five percent." Usually correct.

Decimals in percentages: "3.5%" → "three point five percent." Usually correct.

Fractions

"1/2" → "one half" or "one slash two."
"3/4 cup" → "three quarters cup."

Ambiguous; convert to words for safety.

Scientific notation

"1.5e10" → "one point five times ten to the ten."

Rare in conversational context. If needed, preprocess.

Domain vocabulary

Industry-specific pronunciation:

Medical: drug names, procedures.
Legal: case names, Latin terms.
Financial: ticker symbols, company names.

Most TTS allow custom pronunciation dictionaries. Add domain terms.

Testing

Build a test set of edge cases:

Various date formats.
Currency amounts.
Phone numbers.
Acronyms.
Domain terms.

Run TTS on each; listen; fix with SSML or normalization as needed.

Preprocessing pipeline

Raw text:
"Your appointment is 3/12/2026 at 2:30 PM. Cost: $47.50."

Normalized:
"Your appointment is March 12, 2026 at 2:30 PM. Cost: 47 dollars and 50 cents."

Or with SSML:
<speak>
  Your appointment is <say-as interpret-as="date">2026-03-12</say-as>
  at <say-as interpret-as="time">14:30</say-as>.
  Cost: <say-as interpret-as="currency">$47.50</say-as>.
</speak>

Pipeline step: normalize → synthesize.

Phoneme tuning

For stubborn pronunciations:

<phoneme alphabet="ipa" ph="təˈmeɪtoʊ">tomato</phoneme>

Explicit phoneme specification. Works for unusual names and terms.

See phoneme-level tuning for voice agents.

Caching

Common phrases can be pre-synthesized:

Welcome greetings.
Closing phrases.
Menu options.

Skip TTS for these. Use pre-recorded audio for those slots.

Common pitfalls

Assuming TTS handles everything. Ambiguous inputs produce ambiguous output. Normalize.

Wrong locale. US date format in EU locale. Confusing.

No SSML for tricky inputs. Silently wrong pronunciations.

Untested edge cases. "Your SSN ending in 1234" — TTS reads 1-2-3-4 or one thousand two hundred thirty-four?

Domain blind spots. Medical terms mispronounced. Legal names wrong.

FAQ

Does normalization add latency? Minimal — microseconds for simple regex-based normalization.

Can we use LLM to normalize? Some teams do. Adds latency; marginal quality improvement.

What about audio pronunciation of codes (order numbers, confirmation codes)? Use SSML interpret-as="characters" for letter-by-letter.

How does TTS handle names? Usually OK for common names. Uncommon names need phoneme or dictionary support.

What about multilingual numbers? "One hundred" vs "cien" — match TTS voice language.

How TTS Models Handle Numbers, Dates, and Acronyms

TL;DR

The problem space

Text normalization

How modern TTS handles it

SSML: explicit control

Acronyms vs initialisms

Phone numbers

Dates

Currency

Times

Percentages

Fractions

Scientific notation

Domain vocabulary

Testing

Preprocessing pipeline

Phoneme tuning

Caching

Common pitfalls

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

Comparing Neural TTS Architectures

Phoneme-Level Tuning for Voice Agents

Why Some Voices Sound Robotic Even in 2026

Voice AI, twice a month.