How Voice Agents Handle Accents and Dialects
Voice AI is great at standard American English. It's pretty good at standard British, Australian, and Indian English. It's variably good at everything else.
Voice AI is great at standard American English. It's pretty good at standard British, Australian, and Indian English. It's variably good at everything else. If your customers speak with strong regional accents, code-switch between languages, or use dialects underrepresented in training data, you have to engineer for it. This is what actually works.
TL;DR
- Modern STT handles most major English accents well β Word Error Rate under 8% on Indian, British, Australian, southern US, and AAVE.
- The hard cases are heavy regional accents (Scottish, Cajun, deep AAVE) and code-switching across languages.
- TTS accent quality varies by provider and language pairing β test before committing.
- The single most-underused fix: custom STT vocabularies tuned for your audience.
What "accent" means for STT
When we say "the STT handles accents well," we mean: the WER on accented speech is close to the WER on the model's training distribution. For most modern English STT systems, this means:
- General American: ~3β5% WER
- British (RP): ~4β6%
- Indian English: ~6β9%
- Australian: ~5β7%
- AAVE: ~7β10%
- Heavy regional (Scottish, Cajun, certain Caribbean): 12β20%
Below 10% is workable. Above 15% starts to break the conversation.
Where things fall apart
Three patterns where even good STT struggles:
Code-switching. Speakers who mix two languages mid-sentence ("envΓame el order status please") confuse most monolingual STT systems. The fix is multilingual STT (which has its own accuracy hits).
Heavy regional accents in low-resource regions. A first-generation Hmong speaker in Minneapolis. A Krio speaker in Sierra Leone. STT was trained on too little of this audio.
Disfluencies typical to a region. Some accents have characteristic filler patterns ("isn't it?" at the end of sentences in Indian English) that confuse endpointers.
What to do about it
Mitigations in order of impact:
Custom vocabulary tuned for your audience. If your customer base is heavily Indian English, bias the STT toward Indian English place names, common phrases, and number patterns. Real WER reduction.
Multilingual STT. If your audience code-switches, use a multilingual model. Whisper, Deepgram, AssemblyAI all have multilingual options. WER per language drops slightly, but code-switching cases improve dramatically.
Slower endpointer for accent-prone audiences. Some accents have longer pauses mid-sentence. A flat 600ms endpointer threshold can be too aggressive. Tune by region.
Provider testing. STT providers vary a lot on edge accents. Test with real audio from your audience. Don't trust marketing benchmarks.
What to do on output (TTS)
If your callers are accustomed to a specific accent, your agent should match. A few approaches:
Pick a TTS voice in the matching accent. Most TTS providers offer voices in major accents. Simba has Indian, British, Australian, and South African English voices, for example.
Voice cloning from a brand voice actor. If you have a brand voice in the target accent, clone it. Most natural option.
Test prosody on edge cases. A British TTS might mispronounce American place names; an Indian TTS might struggle with Spanish loan words. Test.
Multilingual specifically
Most "multilingual support" claims are about TTS and STT availability, not full agent quality. Real multilingual support requires:
Multilingual STT that handles your target languages (most major providers do well on top 10β20 languages).
Multilingual TTS with native-speaker quality voices in each language.
Multilingual LLM that responds fluently in each language. All major hosted LLMs do well on top 20 languages; quality drops in lower-resource languages.
Per-language prompts. Direct translation of your English prompt usually doesn't work β register, style, and politeness conventions differ by language. Write a separate prompt per language with help from a native speaker.
For more, see multilingual TTS: choosing a voice model.
What to test before launch
Before deploying to an accented audience:
- 50 sample calls from real customers in the target accent. Measure WER manually if needed.
- Test the agent's pronunciation of common names and places in your audience's region.
- Check the endpointer behavior β does it cut off mid-sentence for slower-paced speakers?
- Verify the LLM's responses don't accidentally adopt American idioms that won't land.
Common bad patterns
A few mistakes I've seen repeatedly:
Ignoring the issue and hoping for the best. Default settings are tuned for General American. They will fail on Indian English without help.
Assuming "multilingual" means "production-ready in language X." It usually means "can transcribe X with degraded accuracy."
Forcing a US accent on a non-US audience. Your callers feel less comfortable; CSAT drops.
Not testing on real customer audio. Lab benchmarks don't capture noisy phone audio with regional accents.
What's improving
The next two years should see:
- Better low-resource language support (Khmer, Swahili, Tagalog, etc.)
- More accent options in TTS without quality penalty
- Better code-switching handling in major STT systems
- More region-specific endpointer tuning
But the gap between "default English" and "your specific dialect" will persist for a long time. Engineering around it remains a real piece of work.
Related reading
- The Anatomy of a Voice Agent Pipeline
- How a Conversational Voice Agent Actually Works (Under the Hood)
- The Hidden Complexity of Numbers in Voice Agents
- How Voice Agents Recover from Misunderstandings
- What Is a Voice Agent? A 2026 Primer
FAQ
Will my agent work for callers from Scotland? Probably not great out of the box. Test with real audio; expect to tune. Strong Scottish accents are still hard.
What about callers who switch between English and Spanish mid-call? Use a multilingual STT model (e.g., Whisper multilingual, Deepgram Nova multilingual). Expect slightly higher WER on each language but better overall handling.
Should I have separate agents per language? For most multilingual deployments, yes β different prompts per language with a router that detects language at the start.
Does TTS sound less natural in non-English languages? Quality varies by provider and language. Simba Multilingual v2 is very good across 30+ languages. Other providers have stronger ranges in different languages.
How important is this for B2B vs B2C? B2C tends to have wider accent diversity; B2B tends to be more standardized. Both deserve attention if your customer base spans regions.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems β text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all βOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
The Hidden Complexity of Numbers in Voice Agents
Numbers are the most underestimated source of pain in voice AI. Phone numbers, account numbers, dates, prices, addresses β all of them have edge cases that turn a clean conversation into a back-and-forth of "no, one nine seven, not nineteen seven." The fix isn't a better LLM;β¦
How Voice Agents Recover from Misunderstandings
Real conversations have misunderstandings. The agent mishears a name, asks the wrong clarifying question, or jumps to the wrong intent. How the agent recovers matters more than how often it stumbles. A graceful recovery can leave the caller feeling like the agent is competent.
The Anatomy of a Voice Agent Pipeline
If you took every voice agent in production today and dissected them, you'd find roughly the same skeleton. The names change. The vendors change. The plumbing details vary.
Voice AI, twice a month.
Get the best of the SIMBA resources hub β new articles, trend notes, and operator guides. No spam.
