Phoneme-Level Tuning for Voice Agents
Most voice agent quality work happens at the text level — prompt engineering, SSML, pronunciation dictionaries. But sometimes the right layer is deeper: phonemes, the individual sound units of spoken language.
Most voice agent quality work happens at the text level — prompt engineering, SSML, pronunciation dictionaries. But sometimes the right layer is deeper: phonemes, the individual sound units of spoken language. For stubborn pronunciation issues, for domain-specific terms, for languages with complex phonetics, phoneme-level tuning is the most reliable fix. This piece covers when it helps, how to do it, and why it's often underused.
TL;DR
- Phonemes are the atomic sound units: /k/, /æ/, /t/ in "cat."
- Phoneme specification overrides TTS's default pronunciation.
- Use for: names, domain terms, unusual words, specific accent variations.
- Standard notation: IPA (International Phonetic Alphabet) or SAMPA.
- Most TTS vendors support via SSML
<phoneme>tag.
Phonemes basics
Every word decomposes into phonemes:
- cat: /kæt/
- happy: /ˈhæpi/
- NovaCorp: /ˈnoʊvəˌkɔrp/
TTS models have internal phoneme-to-audio mapping. By default, text → automatic phonemization → audio. When automatic gets it wrong, override with explicit phonemes.
IPA vs SAMPA
IPA (International Phonetic Alphabet). Standard. Unicode symbols. Used by linguists.
SAMPA. ASCII-only variant. More practical for config files.
X-SAMPA. Extended SAMPA, more complete.
Most TTS vendors support IPA. Some accept SAMPA.
When to use phoneme tuning
Proper names. "Dr. Bathinda" pronounced wrong by default; phoneme override fixes.
Product names. Brand-specific pronunciation.
Acronyms in unusual pronunciation. "SQL" as "sequel" vs "S-Q-L."
Medical / legal terms. Specialized pronunciation.
Non-standard words. Slang, regional terms, invented words.
Foreign words with English spelling. "Hors d'oeuvre", "quinoa."
SSML phoneme tag
Standard approach:
<speak>
Welcome to <phoneme alphabet="ipa" ph="ˈnoʊvəˌkɔrp">NovaCorp</phoneme>.
</speak>
Vendor support:
- Simba: partial.
- Amazon Polly: yes.
- Google Cloud TTS: yes.
- Azure: yes.
- Cartesia: partial.
Pronunciation dictionary approach
Instead of per-use SSML, build a dictionary:
{
"NovaCorp": "ˈnoʊvəˌkɔrp",
"Bathinda": "bəˈθɪndə",
"amoxicillin": "əˌmɒksəˈsɪlɪn"
}
Applied pre-TTS. One place to maintain.
Workflow
- Identify mispronunciations. Sample calls, listen.
- Look up or derive phonemes. Dictionary, IPA chart.
- Test. TTS with phoneme override.
- Add to dictionary.
- Redeploy.
Testing phonemes
Iterative:
- Input text + phoneme.
- Generate audio.
- Listen.
- Adjust phonemes until correct.
Getting IPA right sometimes needs native-speaker ear.
Accent variations
Phonemes can encode accent:
- American "cat": /kæt/.
- British "cat": /kat/.
- Subtle but audible.
For specific accents, phoneme-level control is the cleanest path.
Limitations
- Not all TTS support. Check vendor.
- Tonal languages (Mandarin): phoneme tuning more complex.
- Emotional prosody: not encoded in phonemes.
- Dialectal variations: may need full model retraining.
Phonemization rules
Automatic text-to-phoneme is imperfect:
- English: ~95% accurate on common words, drops on rare.
- Spanish: more phonetic, easier.
- Languages with complex orthography: French, Danish, English — harder.
Auto phonemization fails most on names and rare words — exactly where manual override helps.
Domain dictionaries by vertical
Medical: Drug names, procedures, anatomy. Thousands of domain-specific pronunciations.
Legal: Latin terms, case names, statutes.
Technical: Product names, technologies, acronyms.
Financial: Ticker symbols, company names, instrument types.
Build domain dictionaries once; reuse across deployments.
Automatic phoneme derivation
Tools:
- CMU Pronouncing Dictionary. For common English.
- Phonetizer / G2P (grapheme-to-phoneme) models. Predict phonemes from spelling.
- Online IPA generators. Quick lookups.
For the first pass, automated tools help. Human verification for critical terms.
Performance considerations
- Phoneme specification doesn't add meaningful latency.
- Dictionary lookup is fast.
- SSML parsing is real-time.
No performance cost; pure quality win.
When not to bother
- Common words (TTS gets right).
- Demo / informal use.
- Short-lived content.
- Vendor doesn't support phonemes (limited options then).
The phoneme quality ceiling
Even with perfect phonemes, TTS may:
- Vary speed inappropriately.
- Miss subtle emphasis.
- Apply wrong intonation.
Phonemes fix pronunciation, not prosody. Layer SSML for prosody separately.
Common pitfalls
Ignoring pronunciation issues. Callers hear "nova-korp" instead of "noh-vah-korp." Brand damage.
Wrong phoneme alphabet. Vendor expects IPA; you provide SAMPA. No-op.
Over-engineering. Add phonemes for every word. Unnecessary.
No dictionary maintenance. Phonemes set once; never reviewed.
Native vs native-inspired. Accents shift over time. Re-review periodically.
Related reading
- Text-to-Speech in 2026: The State of the Art
- Comparing Neural TTS Architectures
- Why Some Voices Sound Robotic Even in 2026
- How TTS Models Handle Numbers, Dates, and Acronyms
- Latency Engineering for Real-Time Voice Agents
FAQ
Is phoneme tuning vendor-specific? Mostly. IPA is standard; support varies.
Can we use emoji or special characters? No — phonemes are specific IPA symbols.
What about singing/prosodic extremes? Beyond phonemes. Typically outside voice agent scope.
Do phonemes work for all languages? Most. Tonal languages more complex.
How often should we update phonemes? Quarterly review; immediate fix for reported issues.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all →Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Comparing Neural TTS Architectures
Neural TTS has evolved rapidly since 2018 — Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements.
Why Some Voices Sound Robotic Even in 2026
TTS in 2026 should sound natural. Most of the time it does. But occasionally a synthetic voice still gives itself away — a weird pause, a flat delivery, a strange pronunciation. Understanding why it happens, and what to do about it, is part of the voice engineering discipline.
How TTS Models Handle Numbers, Dates, and Acronyms
Numbers, dates, and acronyms are the trickiest content for TTS. "Dr. Smith will see you on 3/12/2026 for your $47.50 copay" seems simple until you realize the model has to decide: is "3/12" a date or a fraction? Is "$47.50" dollars or just numbers? Is "Dr." "Doctor" or "Drive"?
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
