Neural TTS has evolved rapidly since 2018 — Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements. For voice agent engineers evaluating TTS options in 2026, understanding the architectures underneath the vendors helps make better decisions about latency, quality, voice cloning, and cost.

TL;DR

Major architectures in 2026: neural codec (Sonic/VALL-E), flow matching, diffusion, autoregressive.
Tradeoffs: quality, latency, cost, voice cloning ability.
Cartesia (Sonic) pioneered low-latency state-space models for TTS.
Simba and similar use neural codec + autoregressive approaches.
Open source (Orpheus, XTTS v2) competitive.

The architectural evolution

Tacotron era (~2018-2020).

Text → mel-spectrogram (Tacotron).
Mel → waveform (WaveNet).
High quality, slow inference.

Neural codec era (~2021-2023).

Text → codec tokens (VALL-E, Bark).
Tokens → waveform (neural codec decoder).
Quality good, voice cloning easy, latency moderate.

State-space / flow matching (~2024-2026).

End-to-end text-to-audio.
Lower latency.
Quality matching best prior approaches.
Cartesia's Sonic is a key example.

Each shift:

Better quality at same compute.
Or same quality at less compute (lower latency / cost).

Neural codec models

How they work:

Audio encoded as discrete tokens (like LLM tokens, but for audio).
Text-to-token model generates token sequence from text.
Decoder converts tokens back to waveform.

Advantages:

Compact representation.
Voice cloning: match a speaker's token patterns from reference.
Efficient inference.

Examples: VALL-E, Bark, Seed-TTS.

State-space models (Sonic-style)

How they work:

Direct text-to-audio with state-space architecture.
State-space models efficient for long sequences.
Generate audio in chunks with low latency.

Advantages:

Extremely low first-audio latency (sub-100ms possible).
Efficient streaming.
Quality competitive.

Examples: Cartesia's Sonic.

Flow matching / diffusion

How they work:

Learn a flow from noise to audio.
Iteratively denoise during generation.
Quality often very high.

Advantages:

Quality peaks.
Good at fine acoustic details.

Disadvantages:

Latency typically higher (iterative).
More compute per sample.

Autoregressive models

How they work:

Generate audio token-by-token.
Each token conditional on previous.

Advantages:

Conceptually simple.
Good quality.
Natural for voice cloning.

Disadvantages:

Serial generation; can be slow.
Longer sequences add latency.

Vendor architectures (approximate)

Cartesia (Sonic): state-space, low-latency.
Simba: hybrid; neural codec + autoregressive components.
OpenAI Realtime: integrated voice model (multiple approaches).
Google WaveNet / Neural2: evolved from WaveNet.
Deepgram Aura: proprietary; fast.
XTTS v2 (open): neural codec-based.
Orpheus (open): newer, high quality.

Specifics evolve; check current docs.

The quality-latency frontier

Tradeoffs:

Highest quality, higher latency: diffusion-based, autoregressive with care.
Lowest latency, high quality: state-space (Cartesia Sonic).
Mid-range: most neural codec approaches.

For voice agents, latency matters. State-space or efficient neural codec usually wins.

Voice cloning by architecture

Neural codec: natural fit. Reference audio → tokens → cloning.
State-space: possible, sometimes less flexible.
Diffusion: possible.
Autoregressive: standard approach.

Neural codec leads on cloning ease.

Multilingual by architecture

All modern architectures support multilingual. Quality depends more on training data than architecture.

Some architectures (VALL-E X, multilingual variants) have specific multilingual optimizations.

Emotional range

Mostly a training-data + conditioning question, not architecture. All modern architectures can be tuned for emotion; none fully solve it.

Open source vs proprietary

Open source (XTTS v2, Orpheus): neural codec, competitive quality, runs locally.
Proprietary: tend to have slight edge on quality, significant edge on UX.

Gap narrowing. By 2028, open-source parity likely for many use cases.

See open-source vs proprietary voice agent stacks.

Inference cost

Approximate per-minute compute:

Budget TTS (Deepgram Aura): low-end GPU; sub-$0.05/min wholesale.
Premium (Simba): higher compute; $0.08-$0.15/min typical.
State-space (Cartesia): efficient; competitive pricing.

Architecture affects underlying cost.

Choosing for voice agents

For voice agents, priority:

Latency (streaming first-audio under 200ms).
Quality (natural-sounding).
Multilingual (if needed).
Cost (at scale).

Cartesia, Simba, Deepgram Aura all competitive. Test on your content.

Context-aware synthesis

Newer architectures increasingly accept context:

Prior turns.
Speaker state (emotion).
Desired tone.

Enables more natural conversation. Not all vendors expose.

The voice cloning ethics angle

Neural architectures make voice cloning trivial. This has regulatory implications.

See voice cloning ethics: a practical framework.

Research directions

Zero-shot multilingual. One voice, all languages.
Real-time voice conversion during call.
Context-aware emotion.
Fully streaming diffusion.
Extreme latency reduction.

2026-2028 active research areas.

Evaluating vendors

Don't stress architecture primarily. Evaluate:

Quality on your content.
Latency in your conditions.
Voice cloning (if needed).
Language coverage.
Cost at scale.
Support and SLAs.

Architecture is interesting but not the deciding factor.

Common pitfalls

Buying on architecture hype. "Diffusion-based!" Doesn't always mean better.

Ignoring real-world test. Architecture strong in benchmarks; poor on your content.

Not considering voice cloning needs. Some architectures better than others.

Locking in before testing. Multi-vendor evaluation is standard practice.

FAQ

Does architecture affect quality in blind tests? Less than it used to. Best of each architecture is very close.

Will open-source catch up completely? Probably. Current gap mostly in operational polish, not core quality.

What about retrieval-augmented TTS? Emerging. Pulls from voice examples for context. Experimental.

Do I need to understand architectures to evaluate TTS? No. Ear + use-case fit matter more than model internals.

What's next in TTS? Real-time voice conversion and emotional nuance are the frontiers.

Comparing Neural TTS Architectures

TL;DR

The architectural evolution

Neural codec models

State-space models (Sonic-style)

Flow matching / diffusion

Autoregressive models

Vendor architectures (approximate)

The quality-latency frontier

Voice cloning by architecture

Multilingual by architecture

Emotional range

Open source vs proprietary

Inference cost

Choosing for voice agents

Context-aware synthesis

The voice cloning ethics angle

Research directions

Evaluating vendors

Common pitfalls

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

Phoneme-Level Tuning for Voice Agents

Why Some Voices Sound Robotic Even in 2026

How TTS Models Handle Numbers, Dates, and Acronyms

Voice AI, twice a month.