🔊 Speech Technology

Comparing Neural TTS Architectures

Neural TTS has evolved rapidly since 2018 — Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements.

Tyler Weitzman
Tyler Weitzman
March 20, 2026 · 5 min read
Speechify

Neural TTS has evolved rapidly since 2018 — Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements. For voice agent engineers evaluating TTS options in 2026, understanding the architectures underneath the vendors helps make better decisions about latency, quality, voice cloning, and cost.

TL;DR

  • Major architectures in 2026: neural codec (Sonic/VALL-E), flow matching, diffusion, autoregressive.
  • Tradeoffs: quality, latency, cost, voice cloning ability.
  • Cartesia (Sonic) pioneered low-latency state-space models for TTS.
  • Simba and similar use neural codec + autoregressive approaches.
  • Open source (Orpheus, XTTS v2) competitive.

The architectural evolution

Tacotron era (~2018-2020).

  • Text → mel-spectrogram (Tacotron).
  • Mel → waveform (WaveNet).
  • High quality, slow inference.

Neural codec era (~2021-2023).

  • Text → codec tokens (VALL-E, Bark).
  • Tokens → waveform (neural codec decoder).
  • Quality good, voice cloning easy, latency moderate.

State-space / flow matching (~2024-2026).

  • End-to-end text-to-audio.
  • Lower latency.
  • Quality matching best prior approaches.
  • Cartesia's Sonic is a key example.

Each shift:

  • Better quality at same compute.
  • Or same quality at less compute (lower latency / cost).

Neural codec models

How they work:

  1. Audio encoded as discrete tokens (like LLM tokens, but for audio).
  2. Text-to-token model generates token sequence from text.
  3. Decoder converts tokens back to waveform.

Advantages:

  • Compact representation.
  • Voice cloning: match a speaker's token patterns from reference.
  • Efficient inference.

Examples: VALL-E, Bark, Seed-TTS.

State-space models (Sonic-style)

How they work:

  • Direct text-to-audio with state-space architecture.
  • State-space models efficient for long sequences.
  • Generate audio in chunks with low latency.

Advantages:

  • Extremely low first-audio latency (sub-100ms possible).
  • Efficient streaming.
  • Quality competitive.

Examples: Cartesia's Sonic.

Flow matching / diffusion

How they work:

  • Learn a flow from noise to audio.
  • Iteratively denoise during generation.
  • Quality often very high.

Advantages:

  • Quality peaks.
  • Good at fine acoustic details.

Disadvantages:

  • Latency typically higher (iterative).
  • More compute per sample.

Autoregressive models

How they work:

  • Generate audio token-by-token.
  • Each token conditional on previous.

Advantages:

  • Conceptually simple.
  • Good quality.
  • Natural for voice cloning.

Disadvantages:

  • Serial generation; can be slow.
  • Longer sequences add latency.

Vendor architectures (approximate)

  • Cartesia (Sonic): state-space, low-latency.
  • Simba: hybrid; neural codec + autoregressive components.
  • OpenAI Realtime: integrated voice model (multiple approaches).
  • Google WaveNet / Neural2: evolved from WaveNet.
  • Deepgram Aura: proprietary; fast.
  • XTTS v2 (open): neural codec-based.
  • Orpheus (open): newer, high quality.

Specifics evolve; check current docs.

The quality-latency frontier

Tradeoffs:

  • Highest quality, higher latency: diffusion-based, autoregressive with care.
  • Lowest latency, high quality: state-space (Cartesia Sonic).
  • Mid-range: most neural codec approaches.

For voice agents, latency matters. State-space or efficient neural codec usually wins.

Voice cloning by architecture

  • Neural codec: natural fit. Reference audio → tokens → cloning.
  • State-space: possible, sometimes less flexible.
  • Diffusion: possible.
  • Autoregressive: standard approach.

Neural codec leads on cloning ease.

Multilingual by architecture

All modern architectures support multilingual. Quality depends more on training data than architecture.

Some architectures (VALL-E X, multilingual variants) have specific multilingual optimizations.

Emotional range

Mostly a training-data + conditioning question, not architecture. All modern architectures can be tuned for emotion; none fully solve it.

Open source vs proprietary

  • Open source (XTTS v2, Orpheus): neural codec, competitive quality, runs locally.
  • Proprietary: tend to have slight edge on quality, significant edge on UX.

Gap narrowing. By 2028, open-source parity likely for many use cases.

See open-source vs proprietary voice agent stacks.

Inference cost

Approximate per-minute compute:

  • Budget TTS (Deepgram Aura): low-end GPU; sub-$0.05/min wholesale.
  • Premium (Simba): higher compute; $0.08-$0.15/min typical.
  • State-space (Cartesia): efficient; competitive pricing.

Architecture affects underlying cost.

Choosing for voice agents

For voice agents, priority:

  1. Latency (streaming first-audio under 200ms).
  2. Quality (natural-sounding).
  3. Multilingual (if needed).
  4. Cost (at scale).

Cartesia, Simba, Deepgram Aura all competitive. Test on your content.

Context-aware synthesis

Newer architectures increasingly accept context:

  • Prior turns.
  • Speaker state (emotion).
  • Desired tone.

Enables more natural conversation. Not all vendors expose.

The voice cloning ethics angle

Neural architectures make voice cloning trivial. This has regulatory implications.

See voice cloning ethics: a practical framework.

Research directions

  • Zero-shot multilingual. One voice, all languages.
  • Real-time voice conversion during call.
  • Context-aware emotion.
  • Fully streaming diffusion.
  • Extreme latency reduction.

2026-2028 active research areas.

Evaluating vendors

Don't stress architecture primarily. Evaluate:

  • Quality on your content.
  • Latency in your conditions.
  • Voice cloning (if needed).
  • Language coverage.
  • Cost at scale.
  • Support and SLAs.

Architecture is interesting but not the deciding factor.

Common pitfalls

Buying on architecture hype. "Diffusion-based!" Doesn't always mean better.

Ignoring real-world test. Architecture strong in benchmarks; poor on your content.

Not considering voice cloning needs. Some architectures better than others.

Locking in before testing. Multi-vendor evaluation is standard practice.

FAQ

Does architecture affect quality in blind tests? Less than it used to. Best of each architecture is very close.

Will open-source catch up completely? Probably. Current gap mostly in operational polish, not core quality.

What about retrieval-augmented TTS? Emerging. Pulls from voice examples for context. Experimental.

Do I need to understand architectures to evaluate TTS? No. Ear + use-case fit matter more than model internals.

What's next in TTS? Real-time voice conversion and emotional nuance are the frontiers.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all →

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.