๐Ÿ”Š Speech Technology

Multilingual TTS: Choosing a Voice Model

Multilingual text-to-speech in 2026 is good but uneven. English is excellent. Spanish, French, German, Mandarin, Japanese are strong. Beyond the top 10 languages, quality drops noticeably.

Tyler Weitzman
Tyler Weitzman
March 14, 2026 ยท 4 min read
Speechify

Multilingual text-to-speech in 2026 is good but uneven. English is excellent. Spanish, French, German, Mandarin, Japanese are strong. Beyond the top 10 languages, quality drops noticeably. Choosing a TTS voice model for a multilingual voice agent means balancing quality, coverage, consistency, and latency โ€” with specific tradeoffs per language and per vendor. This piece covers how to pick.

TL;DR

  • Pick based on languages needed, quality in each, latency, cost.
  • Top vendors for multilingual: Google, Simba, Azure, Cartesia.
  • Consider single-voice-multi-language vs per-language voices.
  • Regional accents matter โ€” Mexico Spanish vs Spain Spanish.
  • Test with native speakers in each language.

The multilingual landscape

Languages supported well (2026):

  • Tier 1 (excellent): English, Spanish, French, German, Italian, Portuguese, Mandarin, Japanese, Korean.
  • Tier 2 (good): Dutch, Swedish, Russian, Arabic, Hindi, Turkish, Polish.
  • Tier 3 (decent): Vietnamese, Thai, Indonesian, Hebrew, Filipino.
  • Tier 4 (basic): many smaller languages.

Depth depends on training data.

Vendor comparison

Google Cloud TTS. Widest language coverage. WaveNet and Neural2 voices. Industry-leading for rare languages.

Simba. High quality, limited multilingual (focus on major languages). Voice cloning across languages.

Azure AI Speech. Enterprise-focused. Strong on European languages. Neural voices.

OpenAI (Realtime/Audio). Good multilingual in Realtime API; fewer voice options.

Cartesia. Growing multilingual support; low latency.

Amazon Polly. Good baseline; large language coverage.

Single voice vs per-language

Single voice multi-language (zero-shot): One voice identity speaks any supported language. Simba, some emerging systems.

Pros: consistent brand. Cons: accent may be off in non-native language.

Per-language voices: Separate voice per language, all high quality.

Pros: native sound per language. Cons: brand inconsistency.

For voice agents serving multiple languages, most deployments use per-language voices optimized for each.

Regional accents

Spanish:

  • Latin American (Mexico, Colombia, etc.).
  • Spain Spanish (Castilian).
  • Caribbean Spanish.

Chinese:

  • Mandarin (default).
  • Cantonese.

English:

  • US.
  • UK.
  • Australian.
  • Indian English.

Pick the accent appropriate for your audience.

Quality benchmarking

Benchmark per language:

  • Native speaker listening tests.
  • Blind A/B with human.
  • Domain-specific content.

Don't rely on vendor claims โ€” test yourself.

Latency by language

  • English: lowest latency typically. Most optimized.
  • Top Tier 1 languages: slightly higher (10-30ms more).
  • Less-common languages: variable.

For multilingual deployment, factor latency into choice.

Cost by language

Usually consistent across languages per vendor. Some vendors premium-charge for rare languages.

Multilingual auto-detection

If agent supports multiple languages:

  • Caller's first utterance โ†’ STT detects language.
  • Agent switches to that language's TTS voice.
  • Seamless from caller's perspective.

Requires multilingual STT + voice switching logic.

Code-switching

Callers sometimes mix languages ("Can I get my bill enviado al email?"). Handling:

  • STT detects the primary language.
  • TTS responds in that primary language.
  • Don't try to code-switch in response.

Keeps things simple.

Pronunciation dictionaries

Per-language pronunciation dictionaries:

  • English: medical, legal, technical terms.
  • Spanish: regional variations.
  • Chinese: character-specific.

Each language has its own set of tricky terms.

Voice consistency across languages

If brand voice consistency matters:

  • Consider voice cloning extended across languages (Simba approach).
  • Or pick same-gendered, similar-style voices in each language.

Testing checklist per language

  • โœ… Native speaker listens to 10+ samples.
  • โœ… Numbers pronounced correctly.
  • โœ… Domain terms correct.
  • โœ… Regional accent appropriate.
  • โœ… Common phrases sound natural.
  • โœ… Phone-quality audio acceptable.

Edge case languages

For languages outside top 20:

  • Quality may be poor.
  • Consider alternatives (human interpreter service).
  • Or deploy English with multilingual human handoff.

Compliance

Language access can be regulated:

  • Title VI (US federal funding). Meaningful access in multiple languages.
  • State laws. California has specific rules.
  • EU. Multilingual default expectation.

Quality matters for compliance, not just UX.

See multilingual support: when and how to add a second language.

Cost of multilingual

Adding a second language:

  • Engineering: modest (config + testing).
  • Content: scripts in each language (need native speaker).
  • TTS: per-minute cost similar.
  • Ongoing: monitoring per language.

Typical: 10-20% overhead per additional language.

Common pitfalls

Machine translation for scripts. Bad. Use native speakers.

English-biased testing. Deploy; miss serious issues in other languages.

Regional mismatch. Spain Spanish voice for Mexican audience. Sounds off.

Inconsistent voice style across languages. Brand whiplash.

No per-language monitoring. English works; Spanish fails silently.

FAQ

How many languages can one deployment handle? Technical limit: many. Practical: tune for 2-5 well.

What if our TTS vendor doesn't support a language we need? Layer vendors โ€” primary + fallback for rare languages.

Can voice cloning work multilingual? Yes โ€” Simba clones across languages reasonably.

What about machine translation of scripts? Imperfect. For production, use native translators.

How do we handle ambiguous language detection? Default to English; switch on clear signal.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ€” text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all โ†’

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub โ€” new articles, trend notes, and operator guides. No spam.