🔊 Speech Technology

Phoneme-Level Tuning for Voice Agents

Most voice agent quality work happens at the text level — prompt engineering, SSML, pronunciation dictionaries. But sometimes the right layer is deeper: phonemes, the individual sound units of spoken language.

Tyler Weitzman
Tyler Weitzman
March 19, 2026 · 4 min read
Speechify

Most voice agent quality work happens at the text level — prompt engineering, SSML, pronunciation dictionaries. But sometimes the right layer is deeper: phonemes, the individual sound units of spoken language. For stubborn pronunciation issues, for domain-specific terms, for languages with complex phonetics, phoneme-level tuning is the most reliable fix. This piece covers when it helps, how to do it, and why it's often underused.

TL;DR

  • Phonemes are the atomic sound units: /k/, /æ/, /t/ in "cat."
  • Phoneme specification overrides TTS's default pronunciation.
  • Use for: names, domain terms, unusual words, specific accent variations.
  • Standard notation: IPA (International Phonetic Alphabet) or SAMPA.
  • Most TTS vendors support via SSML <phoneme> tag.

Phonemes basics

Every word decomposes into phonemes:

  • cat: /kæt/
  • happy: /ˈhæpi/
  • NovaCorp: /ˈnoʊvəˌkɔrp/

TTS models have internal phoneme-to-audio mapping. By default, text → automatic phonemization → audio. When automatic gets it wrong, override with explicit phonemes.

IPA vs SAMPA

IPA (International Phonetic Alphabet). Standard. Unicode symbols. Used by linguists.

SAMPA. ASCII-only variant. More practical for config files.

X-SAMPA. Extended SAMPA, more complete.

Most TTS vendors support IPA. Some accept SAMPA.

When to use phoneme tuning

Proper names. "Dr. Bathinda" pronounced wrong by default; phoneme override fixes.

Product names. Brand-specific pronunciation.

Acronyms in unusual pronunciation. "SQL" as "sequel" vs "S-Q-L."

Medical / legal terms. Specialized pronunciation.

Non-standard words. Slang, regional terms, invented words.

Foreign words with English spelling. "Hors d'oeuvre", "quinoa."

SSML phoneme tag

Standard approach:

<speak>
  Welcome to <phoneme alphabet="ipa" ph="ˈnoʊvəˌkɔrp">NovaCorp</phoneme>.
</speak>

Vendor support:

  • Simba: partial.
  • Amazon Polly: yes.
  • Google Cloud TTS: yes.
  • Azure: yes.
  • Cartesia: partial.

Pronunciation dictionary approach

Instead of per-use SSML, build a dictionary:

{
  "NovaCorp": "ˈnoʊvəˌkɔrp",
  "Bathinda": "bəˈθɪndə",
  "amoxicillin": "əˌmɒksəˈsɪlɪn"
}

Applied pre-TTS. One place to maintain.

Workflow

  1. Identify mispronunciations. Sample calls, listen.
  2. Look up or derive phonemes. Dictionary, IPA chart.
  3. Test. TTS with phoneme override.
  4. Add to dictionary.
  5. Redeploy.

Testing phonemes

Iterative:

  • Input text + phoneme.
  • Generate audio.
  • Listen.
  • Adjust phonemes until correct.

Getting IPA right sometimes needs native-speaker ear.

Accent variations

Phonemes can encode accent:

  • American "cat": /kæt/.
  • British "cat": /kat/.
  • Subtle but audible.

For specific accents, phoneme-level control is the cleanest path.

Limitations

  • Not all TTS support. Check vendor.
  • Tonal languages (Mandarin): phoneme tuning more complex.
  • Emotional prosody: not encoded in phonemes.
  • Dialectal variations: may need full model retraining.

Phonemization rules

Automatic text-to-phoneme is imperfect:

  • English: ~95% accurate on common words, drops on rare.
  • Spanish: more phonetic, easier.
  • Languages with complex orthography: French, Danish, English — harder.

Auto phonemization fails most on names and rare words — exactly where manual override helps.

Domain dictionaries by vertical

Medical: Drug names, procedures, anatomy. Thousands of domain-specific pronunciations.

Legal: Latin terms, case names, statutes.

Technical: Product names, technologies, acronyms.

Financial: Ticker symbols, company names, instrument types.

Build domain dictionaries once; reuse across deployments.

Automatic phoneme derivation

Tools:

  • CMU Pronouncing Dictionary. For common English.
  • Phonetizer / G2P (grapheme-to-phoneme) models. Predict phonemes from spelling.
  • Online IPA generators. Quick lookups.

For the first pass, automated tools help. Human verification for critical terms.

Performance considerations

  • Phoneme specification doesn't add meaningful latency.
  • Dictionary lookup is fast.
  • SSML parsing is real-time.

No performance cost; pure quality win.

When not to bother

  • Common words (TTS gets right).
  • Demo / informal use.
  • Short-lived content.
  • Vendor doesn't support phonemes (limited options then).

The phoneme quality ceiling

Even with perfect phonemes, TTS may:

  • Vary speed inappropriately.
  • Miss subtle emphasis.
  • Apply wrong intonation.

Phonemes fix pronunciation, not prosody. Layer SSML for prosody separately.

Common pitfalls

Ignoring pronunciation issues. Callers hear "nova-korp" instead of "noh-vah-korp." Brand damage.

Wrong phoneme alphabet. Vendor expects IPA; you provide SAMPA. No-op.

Over-engineering. Add phonemes for every word. Unnecessary.

No dictionary maintenance. Phonemes set once; never reviewed.

Native vs native-inspired. Accents shift over time. Re-review periodically.

FAQ

Is phoneme tuning vendor-specific? Mostly. IPA is standard; support varies.

Can we use emoji or special characters? No — phonemes are specific IPA symbols.

What about singing/prosodic extremes? Beyond phonemes. Typically outside voice agent scope.

Do phonemes work for all languages? Most. Tonal languages more complex.

How often should we update phonemes? Quarterly review; immediate fix for reported issues.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all →

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.