Most voice agent quality work happens at the text level — prompt engineering, SSML, pronunciation dictionaries. But sometimes the right layer is deeper: phonemes, the individual sound units of spoken language. For stubborn pronunciation issues, for domain-specific terms, for languages with complex phonetics, phoneme-level tuning is the most reliable fix. This piece covers when it helps, how to do it, and why it's often underused.

TL;DR

Phonemes are the atomic sound units: /k/, /æ/, /t/ in "cat."
Phoneme specification overrides TTS's default pronunciation.
Use for: names, domain terms, unusual words, specific accent variations.
Standard notation: IPA (International Phonetic Alphabet) or SAMPA.
Most TTS vendors support via SSML <phoneme> tag.

Phonemes basics

Every word decomposes into phonemes:

cat: /kæt/
happy: /ˈhæpi/
NovaCorp: /ˈnoʊvəˌkɔrp/

TTS models have internal phoneme-to-audio mapping. By default, text → automatic phonemization → audio. When automatic gets it wrong, override with explicit phonemes.

IPA vs SAMPA

IPA (International Phonetic Alphabet). Standard. Unicode symbols. Used by linguists.

SAMPA. ASCII-only variant. More practical for config files.

X-SAMPA. Extended SAMPA, more complete.

Most TTS vendors support IPA. Some accept SAMPA.

When to use phoneme tuning

Proper names. "Dr. Bathinda" pronounced wrong by default; phoneme override fixes.

Product names. Brand-specific pronunciation.

Acronyms in unusual pronunciation. "SQL" as "sequel" vs "S-Q-L."

Medical / legal terms. Specialized pronunciation.

Non-standard words. Slang, regional terms, invented words.

Foreign words with English spelling. "Hors d'oeuvre", "quinoa."

SSML phoneme tag

Standard approach:

<speak>
  Welcome to <phoneme alphabet="ipa" ph="ˈnoʊvəˌkɔrp">NovaCorp</phoneme>.
</speak>

Vendor support:

Simba: partial.
Amazon Polly: yes.
Google Cloud TTS: yes.
Azure: yes.
Cartesia: partial.

Pronunciation dictionary approach

Instead of per-use SSML, build a dictionary:

{
  "NovaCorp": "ˈnoʊvəˌkɔrp",
  "Bathinda": "bəˈθɪndə",
  "amoxicillin": "əˌmɒksəˈsɪlɪn"
}

Applied pre-TTS. One place to maintain.

Workflow

Identify mispronunciations. Sample calls, listen.
Look up or derive phonemes. Dictionary, IPA chart.
Test. TTS with phoneme override.
Add to dictionary.
Redeploy.

Testing phonemes

Iterative:

Input text + phoneme.
Generate audio.
Listen.
Adjust phonemes until correct.

Getting IPA right sometimes needs native-speaker ear.

Accent variations

Phonemes can encode accent:

American "cat": /kæt/.
British "cat": /kat/.
Subtle but audible.

For specific accents, phoneme-level control is the cleanest path.

Limitations

Not all TTS support. Check vendor.
Tonal languages (Mandarin): phoneme tuning more complex.
Emotional prosody: not encoded in phonemes.
Dialectal variations: may need full model retraining.

Phonemization rules

Automatic text-to-phoneme is imperfect:

English: ~95% accurate on common words, drops on rare.
Spanish: more phonetic, easier.
Languages with complex orthography: French, Danish, English — harder.

Auto phonemization fails most on names and rare words — exactly where manual override helps.

Domain dictionaries by vertical

Medical: Drug names, procedures, anatomy. Thousands of domain-specific pronunciations.

Legal: Latin terms, case names, statutes.

Technical: Product names, technologies, acronyms.

Financial: Ticker symbols, company names, instrument types.

Build domain dictionaries once; reuse across deployments.

Automatic phoneme derivation

Tools:

CMU Pronouncing Dictionary. For common English.
Phonetizer / G2P (grapheme-to-phoneme) models. Predict phonemes from spelling.
Online IPA generators. Quick lookups.

For the first pass, automated tools help. Human verification for critical terms.

Performance considerations

Phoneme specification doesn't add meaningful latency.
Dictionary lookup is fast.
SSML parsing is real-time.

No performance cost; pure quality win.

When not to bother

Common words (TTS gets right).
Demo / informal use.
Short-lived content.
Vendor doesn't support phonemes (limited options then).

The phoneme quality ceiling

Even with perfect phonemes, TTS may:

Vary speed inappropriately.
Miss subtle emphasis.
Apply wrong intonation.

Phonemes fix pronunciation, not prosody. Layer SSML for prosody separately.

Common pitfalls

Ignoring pronunciation issues. Callers hear "nova-korp" instead of "noh-vah-korp." Brand damage.

Wrong phoneme alphabet. Vendor expects IPA; you provide SAMPA. No-op.

Over-engineering. Add phonemes for every word. Unnecessary.

No dictionary maintenance. Phonemes set once; never reviewed.

Native vs native-inspired. Accents shift over time. Re-review periodically.

FAQ

Is phoneme tuning vendor-specific? Mostly. IPA is standard; support varies.

Can we use emoji or special characters? No — phonemes are specific IPA symbols.

What about singing/prosodic extremes? Beyond phonemes. Typically outside voice agent scope.

Do phonemes work for all languages? Most. Tonal languages more complex.

How often should we update phonemes? Quarterly review; immediate fix for reported issues.

Phoneme-Level Tuning for Voice Agents

TL;DR

Phonemes basics

IPA vs SAMPA

When to use phoneme tuning

SSML phoneme tag

Pronunciation dictionary approach

Workflow

Testing phonemes

Accent variations

Limitations

Phonemization rules

Domain dictionaries by vertical

Automatic phoneme derivation

Performance considerations

When not to bother

The phoneme quality ceiling

Common pitfalls

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

Comparing Neural TTS Architectures

Why Some Voices Sound Robotic Even in 2026

How TTS Models Handle Numbers, Dates, and Acronyms

Voice AI, twice a month.