Why TTS Quality Plateaus and How to Push Past It
Every voice AI team eventually hits the TTS quality plateau. You pick a good TTS provider, tune some basics, and quality is... fine. Not amazing, not bad. Specific edge cases stay wrong. Certain phrases sound robotic. Numbers get weird. Tone lacks variation.
Every voice AI team eventually hits the TTS quality plateau. You pick a good TTS provider, tune some basics, and quality is... fine. Not amazing, not bad. Specific edge cases stay wrong. Certain phrases sound robotic. Numbers get weird. Tone lacks variation. Most teams stop here because further improvement requires effort that's hard to justify against marginal gains. But the gap between "fine" and "genuinely natural" affects caller experience in ways that compound โ and the techniques to push past the plateau are actually well-understood.
TL;DR
- Base TTS is "good enough" for most use cases; "genuinely natural" requires specific work.
- Gains come from: prompt engineering, SSML tuning, custom voices, context-aware synthesis.
- Phoneme and pronunciation dictionaries close the domain-specific gaps.
- Voice style variation (emphasis, pacing) is the next frontier.
- Measure with human listening tests, not just vendor benchmarks.
Why the plateau exists
Base TTS is:
- Trained on general speech.
- Optimized for average-case quality.
- Not domain-aware.
- Not context-aware.
Your deployment is specific. Generic TTS doesn't match specifically.
The plateau: generic good, domain-specific rough edges.
Gap 1: pronunciation
Names, domain terms, unusual words.
Fix: pronunciation dictionary.
{
"NovaCorp": "noh-vah-korp",
"Athena": "uh-thee-nah",
"Deepgram": "deep-gram"
}
Or SSML <phoneme> tags.
Most TTS vendors allow custom pronunciation. Add terms as discovered.
Gap 2: emphasis
"This is important" vs "This is important" โ same text, different emphasis.
Fix: SSML <emphasis> tags.
<speak>
Your appointment is <emphasis level="strong">tomorrow</emphasis>.
</speak>
Gap 3: pacing
Natural speech varies pace:
- Slow for important information.
- Fast for filler.
- Pause at natural breaks.
Fix: SSML <prosody> and <break> tags.
<speak>
Your appointment is Thursday at 10 AM. <break time="500ms"/>
Please bring your insurance card.
</speak>
Gap 4: context awareness
Base TTS doesn't know what came before. "Oh right, your Tuesday appointment" needs different intonation than "Your Tuesday appointment."
Fix: context-aware prompting.
Some modern TTS (Cartesia, OpenAI) accepts context signals. Leverage when available.
Gap 5: brand consistency
Generic voice doesn't sound like your brand.
Fix: custom voice (cloned from actor or brand voice).
See voice cloning for customer brands: a buyer's guide.
Gap 6: emotion
Flat TTS feels robotic. Real conversations have:
- Warmth on greetings.
- Apologetic tone for delays.
- Enthusiasm for good news.
- Empathy for complaints.
Fix: TTS with emotional control (Simba has some; others emerging).
Still imperfect in 2026 but improving.
Gap 7: turn-taking finesse
Ending sentences with the right fall/rise. Subtle back-channel ("mm-hmm") during caller speech.
Fix: careful prosody + multi-turn conversation modeling.
Most voice agents don't do back-channels. The ones that do feel notably more human.
Gap 8: disfluencies
Real human speech has "um," "uh," brief pauses. Removing them sounds sterile; over-adding sounds fake.
Fix: sparse, context-appropriate filler insertion.
Example:
Agent: "Um, let me check that for you."
Better than flat "Let me check that for you" in some conversational contexts.
The layered approach
For production-grade TTS:
- Baseline: pick high-quality vendor.
- Normalize: numbers, dates, acronyms.
- Dictionary: domain terms.
- SSML: emphasis, pacing, breaks.
- Custom voice: brand consistency.
- Emotion: where applicable.
- Context: conversation-aware.
Each layer adds natural-ness.
Human listening tests
Vendor benchmarks mean little. Test with humans:
- 10โ20 samples of your actual content.
- Multiple native-speaker listeners.
- Score naturalness, clarity, brand fit.
- Compare across TTS configurations.
Fix the failures; keep what works.
Specific problem areas
Long sentences. Over 20 words gets weird intonation. Break up.
Nested clauses. "If X, which means Y, then Z." Restructure for TTS.
Run-on numbers. "Your order number 1234567890123" โ painful. Format or use SSML.
Mid-sentence changes of direction. "I was going to say... actually, wait." TTS handles poorly. Avoid in scripted content.
Cost-quality tradeoff
Premium TTS is ~2x cost of budget TTS. For most voice agents:
- Worth it for customer-facing.
- Not worth it for internal tools.
Measure impact on CSAT before/after upgrade.
The quality ceiling
In 2026, blind A/B tests still sometimes identify TTS:
- 70-85% of listeners can't tell (premium TTS).
- Experienced listeners can.
- Edge cases (emotion, long content) give it away.
Expect full parity in 2โ3 more years.
Iterative improvement
Monthly discipline:
- Sample recent calls.
- Flag awkward TTS moments.
- Fix: dictionary, SSML, prompt.
- Deploy.
- Measure.
Compounds over time.
When to accept the plateau
If callers aren't complaining about TTS specifically and CSAT is high:
- Maybe don't over-invest.
- Focus on conversation design instead.
- TTS is background; content is foreground.
Common pitfalls
Over-engineering TTS. Obsessing while LLM and conversation design are the real gap.
Ignoring domain vocabulary. Pronunciations wrong; callers notice.
No human testing. Silent quality issues.
Upgrading vendor without testing. Premium doesn't automatically mean better for your content.
Static config. Set once, never review. Drift over time.
Related reading
- Text-to-Speech in 2026: The State of the Art
- How to Benchmark a Voice Agent's End-to-End Latency
- Comparing Neural TTS Architectures
- Phoneme-Level Tuning for Voice Agents
- Why Some Voices Sound Robotic Even in 2026
FAQ
How much does TTS quality affect CSAT? Meaningfully but not dominantly. Conversation quality matters more.
When should we upgrade TTS? If you hear specific quality issues, or if brand positioning requires it.
Can AI auto-tune TTS? Some vendors offer. Early; results vary.
What about emotional TTS? Improving. Useful for some use cases (empathy in support).
Do we need SSML? For complex content, yes. For simple conversational, often optional.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all โOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
How to Benchmark a Voice Agent's End-to-End Latency
Vendor-reported latency is a lab number. What matters for your voice agent is measured latency in your production environment, under real network conditions, with your actual content.
Comparing Neural TTS Architectures
Neural TTS has evolved rapidly since 2018 โ Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements.
Phoneme-Level Tuning for Voice Agents
Most voice agent quality work happens at the text level โ prompt engineering, SSML, pronunciation dictionaries. But sometimes the right layer is deeper: phonemes, the individual sound units of spoken language.
Voice AI, twice a month.
Get the best of the SIMBA resources hub โ new articles, trend notes, and operator guides. No spam.
