How TTS Models Handle Numbers, Dates, and Acronyms
Numbers, dates, and acronyms are the trickiest content for TTS. "Dr. Smith will see you on 3/12/2026 for your $47.50 copay" seems simple until you realize the model has to decide: is "3/12" a date or a fraction? Is "$47.50" dollars or just numbers? Is "Dr." "Doctor" or "Drive"?
Numbers, dates, and acronyms are the trickiest content for TTS. "Dr. Smith will see you on 3/12/2026 for your $47.50 copay" seems simple until you realize the model has to decide: is "3/12" a date or a fraction? Is "$47.50" dollars or just numbers? Is "Dr." "Doctor" or "Drive"? Production voice agents handle these correctly thousands of times a day, but it takes specific engineering β both in the TTS model and in how you preprocess text.
TL;DR
- Numbers, dates, and acronyms require text normalization before TTS.
- Modern TTS handles most cases automatically but fails on edge cases.
- Use SSML tags for explicit control.
- Domain-specific pronunciation dictionaries matter.
- Test with your actual content, not generic samples.
The problem space
TTS input can have:
- Integers: "42"
- Decimals: "3.14159"
- Currency: "$47.50", "β¬30"
- Dates: "03/12/2026", "March 12, 2026"
- Times: "14:30", "2:30 PM"
- Phone numbers: "555-1234", "+1-555-123-4567"
- Percentages: "25%"
- Acronyms: "API", "NASA", "HTTP"
- Initialisms: "FBI" (read as letters) vs "NASA" (read as word)
- Abbreviations: "Dr.", "Mr.", "Inc.", "St."
Each requires different pronunciation.
Text normalization
Before TTS, normalize:
- "03/12/2026" β "March 12, 2026"
- "$47.50" β "forty-seven dollars and fifty cents"
- "25%" β "twenty-five percent"
- "Dr. Smith" β "Doctor Smith"
Some TTS engines do this automatically; others require pre-processing.
How modern TTS handles it
High-end TTS (Simba, Cartesia, OpenAI) handles most cases:
- Decimals: "3.14" β "three point one four."
- Currency: "$47.50" β "forty-seven dollars and fifty cents."
- Phone numbers: "555-1234" β often "five five five, one two three four."
- Dates: mostly correct.
- Acronyms: correctly distinguished (usually).
But edge cases fail:
- Ambiguous dates: "3/12" could be March 12 or 3 of 12.
- Roman numerals: "Louis XIV" as "Louis fourteen" vs "X I V."
- Industry-specific: "100 mg/dL" β how to read?
- Phone-number formats that vary.
SSML: explicit control
SSML (Speech Synthesis Markup Language) lets you specify pronunciation:
<speak>
Call me at <say-as interpret-as="telephone">5551234</say-as>
on <say-as interpret-as="date" format="mdy">03/12/2026</say-as>.
</speak>
Common SSML tags:
<say-as interpret-as="telephone"><say-as interpret-as="date"><say-as interpret-as="currency"><say-as interpret-as="characters">(spell out letters)<say-as interpret-as="ordinal"><phoneme alphabet="ipa" ph="...">(explicit phonemes)
Support varies by TTS vendor.
Acronyms vs initialisms
- Initialism: read as letters. "FBI" β "F-B-I."
- Acronym: read as word. "NASA" β "Nassa."
Most modern TTS has a built-in dictionary but doesn't know all. Custom additions:
- Your company name (especially if acronym).
- Product names.
- Industry-specific terms.
Phone numbers
Common formats:
- "555-1234"
- "(555) 123-4567"
- "+1 555 123 4567"
- "1-800-555-1234"
TTS should pause between groups. Test your formats.
Best practice:
- Always pass in a consistent format.
- Use SSML
<say-as interpret-as="telephone">for reliability.
Dates
Cultural variation:
- US: MM/DD/YYYY.
- Most of world: DD/MM/YYYY.
Ambiguous: "04/05/2026" = April 5 (US) or May 4 (EU).
Convert to unambiguous form before TTS:
- "April 5, 2026" (explicit).
- Or use SSML with date format specifier.
Currency
"$47.50" could be "forty-seven fifty" or "forty-seven dollars and fifty cents."
Modern TTS usually handles correctly. For non-USD:
- "Β£100" β "one hundred pounds."
- "β¬30" β "thirty euros."
- "Β₯1000" β "one thousand yen."
Test your currency formats.
Times
- "14:30" β "fourteen thirty" or "two-thirty PM."
- "9:00 AM" β "nine AM."
TTS usually handles. AM/PM vs 24-hour: convert to AM/PM for natural speech.
Percentages
"25%" β "twenty-five percent." Usually correct.
Decimals in percentages: "3.5%" β "three point five percent." Usually correct.
Fractions
- "1/2" β "one half" or "one slash two."
- "3/4 cup" β "three quarters cup."
Ambiguous; convert to words for safety.
Scientific notation
- "1.5e10" β "one point five times ten to the ten."
Rare in conversational context. If needed, preprocess.
Domain vocabulary
Industry-specific pronunciation:
- Medical: drug names, procedures.
- Legal: case names, Latin terms.
- Financial: ticker symbols, company names.
Most TTS allow custom pronunciation dictionaries. Add domain terms.
Testing
Build a test set of edge cases:
- Various date formats.
- Currency amounts.
- Phone numbers.
- Acronyms.
- Domain terms.
Run TTS on each; listen; fix with SSML or normalization as needed.
Preprocessing pipeline
Raw text:
"Your appointment is 3/12/2026 at 2:30 PM. Cost: $47.50."
Normalized:
"Your appointment is March 12, 2026 at 2:30 PM. Cost: 47 dollars and 50 cents."
Or with SSML:
<speak>
Your appointment is <say-as interpret-as="date">2026-03-12</say-as>
at <say-as interpret-as="time">14:30</say-as>.
Cost: <say-as interpret-as="currency">$47.50</say-as>.
</speak>
Pipeline step: normalize β synthesize.
Phoneme tuning
For stubborn pronunciations:
<phoneme alphabet="ipa" ph="tΙΛmeΙͺtoΚ">tomato</phoneme>
Explicit phoneme specification. Works for unusual names and terms.
See phoneme-level tuning for voice agents.
Caching
Common phrases can be pre-synthesized:
- Welcome greetings.
- Closing phrases.
- Menu options.
Skip TTS for these. Use pre-recorded audio for those slots.
Common pitfalls
Assuming TTS handles everything. Ambiguous inputs produce ambiguous output. Normalize.
Wrong locale. US date format in EU locale. Confusing.
No SSML for tricky inputs. Silently wrong pronunciations.
Untested edge cases. "Your SSN ending in 1234" β TTS reads 1-2-3-4 or one thousand two hundred thirty-four?
Domain blind spots. Medical terms mispronounced. Legal names wrong.
Related reading
- Text-to-Speech in 2026: The State of the Art
- Comparing Neural TTS Architectures
- Why Some Voices Sound Robotic Even in 2026
- Latency Engineering for Real-Time Voice Agents
FAQ
Does normalization add latency? Minimal β microseconds for simple regex-based normalization.
Can we use LLM to normalize? Some teams do. Adds latency; marginal quality improvement.
What about audio pronunciation of codes (order numbers, confirmation codes)?
Use SSML interpret-as="characters" for letter-by-letter.
How does TTS handle names? Usually OK for common names. Uncommon names need phoneme or dictionary support.
What about multilingual numbers? "One hundred" vs "cien" β match TTS voice language.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems β text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all βOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Comparing Neural TTS Architectures
Neural TTS has evolved rapidly since 2018 β Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements.
Phoneme-Level Tuning for Voice Agents
Most voice agent quality work happens at the text level β prompt engineering, SSML, pronunciation dictionaries. But sometimes the right layer is deeper: phonemes, the individual sound units of spoken language.
Why Some Voices Sound Robotic Even in 2026
TTS in 2026 should sound natural. Most of the time it does. But occasionally a synthetic voice still gives itself away β a weird pause, a flat delivery, a strange pronunciation. Understanding why it happens, and what to do about it, is part of the voice engineering discipline.
Voice AI, twice a month.
Get the best of the SIMBA resources hub β new articles, trend notes, and operator guides. No spam.
