Comparing Neural TTS Architectures
Neural TTS has evolved rapidly since 2018 — Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements.
Neural TTS has evolved rapidly since 2018 — Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements. For voice agent engineers evaluating TTS options in 2026, understanding the architectures underneath the vendors helps make better decisions about latency, quality, voice cloning, and cost.
TL;DR
- Major architectures in 2026: neural codec (Sonic/VALL-E), flow matching, diffusion, autoregressive.
- Tradeoffs: quality, latency, cost, voice cloning ability.
- Cartesia (Sonic) pioneered low-latency state-space models for TTS.
- Simba and similar use neural codec + autoregressive approaches.
- Open source (Orpheus, XTTS v2) competitive.
The architectural evolution
Tacotron era (~2018-2020).
- Text → mel-spectrogram (Tacotron).
- Mel → waveform (WaveNet).
- High quality, slow inference.
Neural codec era (~2021-2023).
- Text → codec tokens (VALL-E, Bark).
- Tokens → waveform (neural codec decoder).
- Quality good, voice cloning easy, latency moderate.
State-space / flow matching (~2024-2026).
- End-to-end text-to-audio.
- Lower latency.
- Quality matching best prior approaches.
- Cartesia's Sonic is a key example.
Each shift:
- Better quality at same compute.
- Or same quality at less compute (lower latency / cost).
Neural codec models
How they work:
- Audio encoded as discrete tokens (like LLM tokens, but for audio).
- Text-to-token model generates token sequence from text.
- Decoder converts tokens back to waveform.
Advantages:
- Compact representation.
- Voice cloning: match a speaker's token patterns from reference.
- Efficient inference.
Examples: VALL-E, Bark, Seed-TTS.
State-space models (Sonic-style)
How they work:
- Direct text-to-audio with state-space architecture.
- State-space models efficient for long sequences.
- Generate audio in chunks with low latency.
Advantages:
- Extremely low first-audio latency (sub-100ms possible).
- Efficient streaming.
- Quality competitive.
Examples: Cartesia's Sonic.
Flow matching / diffusion
How they work:
- Learn a flow from noise to audio.
- Iteratively denoise during generation.
- Quality often very high.
Advantages:
- Quality peaks.
- Good at fine acoustic details.
Disadvantages:
- Latency typically higher (iterative).
- More compute per sample.
Autoregressive models
How they work:
- Generate audio token-by-token.
- Each token conditional on previous.
Advantages:
- Conceptually simple.
- Good quality.
- Natural for voice cloning.
Disadvantages:
- Serial generation; can be slow.
- Longer sequences add latency.
Vendor architectures (approximate)
- Cartesia (Sonic): state-space, low-latency.
- Simba: hybrid; neural codec + autoregressive components.
- OpenAI Realtime: integrated voice model (multiple approaches).
- Google WaveNet / Neural2: evolved from WaveNet.
- Deepgram Aura: proprietary; fast.
- XTTS v2 (open): neural codec-based.
- Orpheus (open): newer, high quality.
Specifics evolve; check current docs.
The quality-latency frontier
Tradeoffs:
- Highest quality, higher latency: diffusion-based, autoregressive with care.
- Lowest latency, high quality: state-space (Cartesia Sonic).
- Mid-range: most neural codec approaches.
For voice agents, latency matters. State-space or efficient neural codec usually wins.
Voice cloning by architecture
- Neural codec: natural fit. Reference audio → tokens → cloning.
- State-space: possible, sometimes less flexible.
- Diffusion: possible.
- Autoregressive: standard approach.
Neural codec leads on cloning ease.
Multilingual by architecture
All modern architectures support multilingual. Quality depends more on training data than architecture.
Some architectures (VALL-E X, multilingual variants) have specific multilingual optimizations.
Emotional range
Mostly a training-data + conditioning question, not architecture. All modern architectures can be tuned for emotion; none fully solve it.
Open source vs proprietary
- Open source (XTTS v2, Orpheus): neural codec, competitive quality, runs locally.
- Proprietary: tend to have slight edge on quality, significant edge on UX.
Gap narrowing. By 2028, open-source parity likely for many use cases.
See open-source vs proprietary voice agent stacks.
Inference cost
Approximate per-minute compute:
- Budget TTS (Deepgram Aura): low-end GPU; sub-$0.05/min wholesale.
- Premium (Simba): higher compute; $0.08-$0.15/min typical.
- State-space (Cartesia): efficient; competitive pricing.
Architecture affects underlying cost.
Choosing for voice agents
For voice agents, priority:
- Latency (streaming first-audio under 200ms).
- Quality (natural-sounding).
- Multilingual (if needed).
- Cost (at scale).
Cartesia, Simba, Deepgram Aura all competitive. Test on your content.
Context-aware synthesis
Newer architectures increasingly accept context:
- Prior turns.
- Speaker state (emotion).
- Desired tone.
Enables more natural conversation. Not all vendors expose.
The voice cloning ethics angle
Neural architectures make voice cloning trivial. This has regulatory implications.
See voice cloning ethics: a practical framework.
Research directions
- Zero-shot multilingual. One voice, all languages.
- Real-time voice conversion during call.
- Context-aware emotion.
- Fully streaming diffusion.
- Extreme latency reduction.
2026-2028 active research areas.
Evaluating vendors
Don't stress architecture primarily. Evaluate:
- Quality on your content.
- Latency in your conditions.
- Voice cloning (if needed).
- Language coverage.
- Cost at scale.
- Support and SLAs.
Architecture is interesting but not the deciding factor.
Common pitfalls
Buying on architecture hype. "Diffusion-based!" Doesn't always mean better.
Ignoring real-world test. Architecture strong in benchmarks; poor on your content.
Not considering voice cloning needs. Some architectures better than others.
Locking in before testing. Multi-vendor evaluation is standard practice.
Related reading
- Text-to-Speech in 2026: The State of the Art
- Phoneme-Level Tuning for Voice Agents
- Why Some Voices Sound Robotic Even in 2026
- How TTS Models Handle Numbers, Dates, and Acronyms
- Latency Engineering for Real-Time Voice Agents
FAQ
Does architecture affect quality in blind tests? Less than it used to. Best of each architecture is very close.
Will open-source catch up completely? Probably. Current gap mostly in operational polish, not core quality.
What about retrieval-augmented TTS? Emerging. Pulls from voice examples for context. Experimental.
Do I need to understand architectures to evaluate TTS? No. Ear + use-case fit matter more than model internals.
What's next in TTS? Real-time voice conversion and emotional nuance are the frontiers.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all →Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Phoneme-Level Tuning for Voice Agents
Most voice agent quality work happens at the text level — prompt engineering, SSML, pronunciation dictionaries. But sometimes the right layer is deeper: phonemes, the individual sound units of spoken language.
Why Some Voices Sound Robotic Even in 2026
TTS in 2026 should sound natural. Most of the time it does. But occasionally a synthetic voice still gives itself away — a weird pause, a flat delivery, a strange pronunciation. Understanding why it happens, and what to do about it, is part of the voice engineering discipline.
How TTS Models Handle Numbers, Dates, and Acronyms
Numbers, dates, and acronyms are the trickiest content for TTS. "Dr. Smith will see you on 3/12/2026 for your $47.50 copay" seems simple until you realize the model has to decide: is "3/12" a date or a fraction? Is "$47.50" dollars or just numbers? Is "Dr." "Doctor" or "Drive"?
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
