πŸŽ™οΈ Voice AI Fundamentals

How Voice Agents Handle Accents and Dialects

Voice AI is great at standard American English. It's pretty good at standard British, Australian, and Indian English. It's variably good at everything else.

Tyler Weitzman
Tyler Weitzman
January 13, 2026 Β· 5 min read
Speechify

Voice AI is great at standard American English. It's pretty good at standard British, Australian, and Indian English. It's variably good at everything else. If your customers speak with strong regional accents, code-switch between languages, or use dialects underrepresented in training data, you have to engineer for it. This is what actually works.

TL;DR

  • Modern STT handles most major English accents well β€” Word Error Rate under 8% on Indian, British, Australian, southern US, and AAVE.
  • The hard cases are heavy regional accents (Scottish, Cajun, deep AAVE) and code-switching across languages.
  • TTS accent quality varies by provider and language pairing β€” test before committing.
  • The single most-underused fix: custom STT vocabularies tuned for your audience.

What "accent" means for STT

When we say "the STT handles accents well," we mean: the WER on accented speech is close to the WER on the model's training distribution. For most modern English STT systems, this means:

  • General American: ~3–5% WER
  • British (RP): ~4–6%
  • Indian English: ~6–9%
  • Australian: ~5–7%
  • AAVE: ~7–10%
  • Heavy regional (Scottish, Cajun, certain Caribbean): 12–20%

Below 10% is workable. Above 15% starts to break the conversation.

Where things fall apart

Three patterns where even good STT struggles:

Code-switching. Speakers who mix two languages mid-sentence ("envΓ­ame el order status please") confuse most monolingual STT systems. The fix is multilingual STT (which has its own accuracy hits).

Heavy regional accents in low-resource regions. A first-generation Hmong speaker in Minneapolis. A Krio speaker in Sierra Leone. STT was trained on too little of this audio.

Disfluencies typical to a region. Some accents have characteristic filler patterns ("isn't it?" at the end of sentences in Indian English) that confuse endpointers.

What to do about it

Mitigations in order of impact:

Custom vocabulary tuned for your audience. If your customer base is heavily Indian English, bias the STT toward Indian English place names, common phrases, and number patterns. Real WER reduction.

Multilingual STT. If your audience code-switches, use a multilingual model. Whisper, Deepgram, AssemblyAI all have multilingual options. WER per language drops slightly, but code-switching cases improve dramatically.

Slower endpointer for accent-prone audiences. Some accents have longer pauses mid-sentence. A flat 600ms endpointer threshold can be too aggressive. Tune by region.

Provider testing. STT providers vary a lot on edge accents. Test with real audio from your audience. Don't trust marketing benchmarks.

What to do on output (TTS)

If your callers are accustomed to a specific accent, your agent should match. A few approaches:

Pick a TTS voice in the matching accent. Most TTS providers offer voices in major accents. Simba has Indian, British, Australian, and South African English voices, for example.

Voice cloning from a brand voice actor. If you have a brand voice in the target accent, clone it. Most natural option.

Test prosody on edge cases. A British TTS might mispronounce American place names; an Indian TTS might struggle with Spanish loan words. Test.

Multilingual specifically

Most "multilingual support" claims are about TTS and STT availability, not full agent quality. Real multilingual support requires:

Multilingual STT that handles your target languages (most major providers do well on top 10–20 languages).

Multilingual TTS with native-speaker quality voices in each language.

Multilingual LLM that responds fluently in each language. All major hosted LLMs do well on top 20 languages; quality drops in lower-resource languages.

Per-language prompts. Direct translation of your English prompt usually doesn't work β€” register, style, and politeness conventions differ by language. Write a separate prompt per language with help from a native speaker.

For more, see multilingual TTS: choosing a voice model.

What to test before launch

Before deploying to an accented audience:

  • 50 sample calls from real customers in the target accent. Measure WER manually if needed.
  • Test the agent's pronunciation of common names and places in your audience's region.
  • Check the endpointer behavior β€” does it cut off mid-sentence for slower-paced speakers?
  • Verify the LLM's responses don't accidentally adopt American idioms that won't land.

Common bad patterns

A few mistakes I've seen repeatedly:

Ignoring the issue and hoping for the best. Default settings are tuned for General American. They will fail on Indian English without help.

Assuming "multilingual" means "production-ready in language X." It usually means "can transcribe X with degraded accuracy."

Forcing a US accent on a non-US audience. Your callers feel less comfortable; CSAT drops.

Not testing on real customer audio. Lab benchmarks don't capture noisy phone audio with regional accents.

What's improving

The next two years should see:

  • Better low-resource language support (Khmer, Swahili, Tagalog, etc.)
  • More accent options in TTS without quality penalty
  • Better code-switching handling in major STT systems
  • More region-specific endpointer tuning

But the gap between "default English" and "your specific dialect" will persist for a long time. Engineering around it remains a real piece of work.

FAQ

Will my agent work for callers from Scotland? Probably not great out of the box. Test with real audio; expect to tune. Strong Scottish accents are still hard.

What about callers who switch between English and Spanish mid-call? Use a multilingual STT model (e.g., Whisper multilingual, Deepgram Nova multilingual). Expect slightly higher WER on each language but better overall handling.

Should I have separate agents per language? For most multilingual deployments, yes β€” different prompts per language with a router that detects language at the start.

Does TTS sound less natural in non-English languages? Quality varies by provider and language. Simba Multilingual v2 is very good across 30+ languages. Other providers have stronger ranges in different languages.

How important is this for B2B vs B2C? B2C tends to have wider accent diversity; B2B tends to be more standardized. Both deserve attention if your customer base spans regions.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems β€” text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all β†’

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub β€” new articles, trend notes, and operator guides. No spam.