๐Ÿ”Š Speech Technology

Why TTS Quality Plateaus and How to Push Past It

Every voice AI team eventually hits the TTS quality plateau. You pick a good TTS provider, tune some basics, and quality is... fine. Not amazing, not bad. Specific edge cases stay wrong. Certain phrases sound robotic. Numbers get weird. Tone lacks variation.

Tyler Weitzman
Tyler Weitzman
March 13, 2026 ยท 5 min read
Speechify

Every voice AI team eventually hits the TTS quality plateau. You pick a good TTS provider, tune some basics, and quality is... fine. Not amazing, not bad. Specific edge cases stay wrong. Certain phrases sound robotic. Numbers get weird. Tone lacks variation. Most teams stop here because further improvement requires effort that's hard to justify against marginal gains. But the gap between "fine" and "genuinely natural" affects caller experience in ways that compound โ€” and the techniques to push past the plateau are actually well-understood.

TL;DR

  • Base TTS is "good enough" for most use cases; "genuinely natural" requires specific work.
  • Gains come from: prompt engineering, SSML tuning, custom voices, context-aware synthesis.
  • Phoneme and pronunciation dictionaries close the domain-specific gaps.
  • Voice style variation (emphasis, pacing) is the next frontier.
  • Measure with human listening tests, not just vendor benchmarks.

Why the plateau exists

Base TTS is:

  • Trained on general speech.
  • Optimized for average-case quality.
  • Not domain-aware.
  • Not context-aware.

Your deployment is specific. Generic TTS doesn't match specifically.

The plateau: generic good, domain-specific rough edges.

Gap 1: pronunciation

Names, domain terms, unusual words.

Fix: pronunciation dictionary.

{
  "NovaCorp": "noh-vah-korp",
  "Athena": "uh-thee-nah",
  "Deepgram": "deep-gram"
}

Or SSML <phoneme> tags.

Most TTS vendors allow custom pronunciation. Add terms as discovered.

Gap 2: emphasis

"This is important" vs "This is important" โ€” same text, different emphasis.

Fix: SSML <emphasis> tags.

<speak>
  Your appointment is <emphasis level="strong">tomorrow</emphasis>.
</speak>

Gap 3: pacing

Natural speech varies pace:

  • Slow for important information.
  • Fast for filler.
  • Pause at natural breaks.

Fix: SSML <prosody> and <break> tags.

<speak>
  Your appointment is Thursday at 10 AM. <break time="500ms"/>
  Please bring your insurance card.
</speak>

Gap 4: context awareness

Base TTS doesn't know what came before. "Oh right, your Tuesday appointment" needs different intonation than "Your Tuesday appointment."

Fix: context-aware prompting.

Some modern TTS (Cartesia, OpenAI) accepts context signals. Leverage when available.

Gap 5: brand consistency

Generic voice doesn't sound like your brand.

Fix: custom voice (cloned from actor or brand voice).

See voice cloning for customer brands: a buyer's guide.

Gap 6: emotion

Flat TTS feels robotic. Real conversations have:

  • Warmth on greetings.
  • Apologetic tone for delays.
  • Enthusiasm for good news.
  • Empathy for complaints.

Fix: TTS with emotional control (Simba has some; others emerging).

Still imperfect in 2026 but improving.

Gap 7: turn-taking finesse

Ending sentences with the right fall/rise. Subtle back-channel ("mm-hmm") during caller speech.

Fix: careful prosody + multi-turn conversation modeling.

Most voice agents don't do back-channels. The ones that do feel notably more human.

Gap 8: disfluencies

Real human speech has "um," "uh," brief pauses. Removing them sounds sterile; over-adding sounds fake.

Fix: sparse, context-appropriate filler insertion.

Example:

Agent: "Um, let me check that for you."

Better than flat "Let me check that for you" in some conversational contexts.

The layered approach

For production-grade TTS:

  1. Baseline: pick high-quality vendor.
  2. Normalize: numbers, dates, acronyms.
  3. Dictionary: domain terms.
  4. SSML: emphasis, pacing, breaks.
  5. Custom voice: brand consistency.
  6. Emotion: where applicable.
  7. Context: conversation-aware.

Each layer adds natural-ness.

Human listening tests

Vendor benchmarks mean little. Test with humans:

  • 10โ€“20 samples of your actual content.
  • Multiple native-speaker listeners.
  • Score naturalness, clarity, brand fit.
  • Compare across TTS configurations.

Fix the failures; keep what works.

Specific problem areas

Long sentences. Over 20 words gets weird intonation. Break up.

Nested clauses. "If X, which means Y, then Z." Restructure for TTS.

Run-on numbers. "Your order number 1234567890123" โ€” painful. Format or use SSML.

Mid-sentence changes of direction. "I was going to say... actually, wait." TTS handles poorly. Avoid in scripted content.

Cost-quality tradeoff

Premium TTS is ~2x cost of budget TTS. For most voice agents:

  • Worth it for customer-facing.
  • Not worth it for internal tools.

Measure impact on CSAT before/after upgrade.

The quality ceiling

In 2026, blind A/B tests still sometimes identify TTS:

  • 70-85% of listeners can't tell (premium TTS).
  • Experienced listeners can.
  • Edge cases (emotion, long content) give it away.

Expect full parity in 2โ€“3 more years.

Iterative improvement

Monthly discipline:

  • Sample recent calls.
  • Flag awkward TTS moments.
  • Fix: dictionary, SSML, prompt.
  • Deploy.
  • Measure.

Compounds over time.

When to accept the plateau

If callers aren't complaining about TTS specifically and CSAT is high:

  • Maybe don't over-invest.
  • Focus on conversation design instead.
  • TTS is background; content is foreground.

Common pitfalls

Over-engineering TTS. Obsessing while LLM and conversation design are the real gap.

Ignoring domain vocabulary. Pronunciations wrong; callers notice.

No human testing. Silent quality issues.

Upgrading vendor without testing. Premium doesn't automatically mean better for your content.

Static config. Set once, never review. Drift over time.

FAQ

How much does TTS quality affect CSAT? Meaningfully but not dominantly. Conversation quality matters more.

When should we upgrade TTS? If you hear specific quality issues, or if brand positioning requires it.

Can AI auto-tune TTS? Some vendors offer. Early; results vary.

What about emotional TTS? Improving. Useful for some use cases (empathy in support).

Do we need SSML? For complex content, yes. For simple conversational, often optional.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ€” text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all โ†’

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub โ€” new articles, trend notes, and operator guides. No spam.