Speech Technology
TTS, STT, voice cloning, latency engineering, and the hard parts of making AI sound human.
24 articles
Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport — lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
How to Benchmark a Voice Agent's End-to-End Latency
Vendor-reported latency is a lab number. What matters for your voice agent is measured latency in your production environment, under real network conditions, with your actual content.
Comparing Neural TTS Architectures
Neural TTS has evolved rapidly since 2018 — Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements.
Phoneme-Level Tuning for Voice Agents
Most voice agent quality work happens at the text level — prompt engineering, SSML, pronunciation dictionaries. But sometimes the right layer is deeper: phonemes, the individual sound units of spoken language.
Why Some Voices Sound Robotic Even in 2026
TTS in 2026 should sound natural. Most of the time it does. But occasionally a synthetic voice still gives itself away — a weird pause, a flat delivery, a strange pronunciation. Understanding why it happens, and what to do about it, is part of the voice engineering discipline.
Voice Cloning for Customer Brands: A Buyer's Guide
Voice cloning has become cheap enough that every company with a voice channel is asking the same question: should we use a custom brand voice instead of a stock voice model?
How Sample Rate Affects Voice Agent Quality
Sample rate is one of those low-level audio details that voice agent builders often inherit without thinking about. The STT config says 16 kHz; the TTS outputs 24 kHz; the PSTN leg is 8 kHz.
Echo Cancellation in Real-Time Voice AI
Echo in voice agent calls sounds like this: agent starts speaking, caller's speaker plays agent's voice, caller's microphone picks up agent's voice, the audio flows back to the agent, agent's STT transcribes its own speech, agent gets confused, conversation breaks down.
How Background Noise Affects Voice Agent Accuracy
Production voice agents live in noisy environments. Callers call from cars, offices, restaurants, kitchens with running faucets, grocery stores with loud music, outdoor job sites. Real audio has sirens, barking dogs, other conversations, and TV in the background.
Audio Codecs for Voice Agents: Opus, PCMU, and More
Audio codecs determine the quality, bandwidth, and latency of every voice agent call. The choice between G.711, Opus, G.722, and others affects how your audio sounds over the line, how much bandwidth you consume, and how well STT and TTS perform.
Diarization: Knowing Who's Speaking in a Voice Conversation
Speaker diarization is the task of answering "who spoke when?" Given audio with multiple speakers, diarization outputs time-stamped segments labeled by speaker. For most voice agent use cases — one caller, one agent — diarization is trivial (channel-based separation works).
Voice Activity Detection in Production Voice Agents
Voice Activity Detection — VAD — is the unglamorous infrastructure deciding when the caller has started speaking, when they've paused, and when they're definitively done. It sits upstream of STT, LLM, and TTS, but bad VAD can ruin an otherwise excellent voice agent.
The Engineering Behind Sub-Second Voice Agents
Sub-second voice agents — end-to-end latency under 1000ms from caller speech end to agent speech start — used to be aspirational. In 2026 it's table stakes for production voice AI, and leading deployments are hitting sub-500ms.
How STT Handles Disfluencies and Filler Words
Real speech is messy. People say "um," "uh," "like," and "you know" constantly. They start sentences and abandon them. They repeat themselves. They mumble and correct.
Multilingual TTS: Choosing a Voice Model
Multilingual text-to-speech in 2026 is good but uneven. English is excellent. Spanish, French, German, Mandarin, Japanese are strong. Beyond the top 10 languages, quality drops noticeably.
Why TTS Quality Plateaus and How to Push Past It
Every voice AI team eventually hits the TTS quality plateau. You pick a good TTS provider, tune some basics, and quality is... fine. Not amazing, not bad. Specific edge cases stay wrong. Certain phrases sound robotic. Numbers get weird. Tone lacks variation.
How TTS Models Handle Numbers, Dates, and Acronyms
Numbers, dates, and acronyms are the trickiest content for TTS. "Dr. Smith will see you on 3/12/2026 for your $47.50 copay" seems simple until you realize the model has to decide: is "3/12" a date or a fraction? Is "$47.50" dollars or just numbers? Is "Dr." "Doctor" or "Drive"?
Streaming STT: How to Cut Recognition Latency
Non-streaming speech-to-text works for transcription — you submit audio, wait, get a transcript. That pattern is fine for batch use cases but fatal for voice agents.
Streaming TTS: How to Cut First-Audio Latency
First-audio latency — the time from when the TTS receives text to when the caller hears the first sound — is one of the biggest levers in voice agent latency optimization.
Latency Engineering for Real-Time Voice Agents
Latency is what separates voice agents that feel conversational from those that feel broken. Humans expect responses within 700ms of finishing a sentence — anything longer triggers a "did they hear me?" reaction. Sub-500ms feels alive. Sub-300ms feels exceptional.
Voice Cloning Ethics: A Practical Framework
Voice cloning technology moved from research lab to commodity in roughly 18 months. The legal framework has lagged, the industry ethical consensus lags further, and individual practitioners are left to make judgment calls in a space where the wrong choice harms real people.
Voice Cloning: How It Works and Why It Matters
Voice cloning — the technology to replicate a specific person's voice from a short audio sample — has been one of the most disruptive developments in voice AI. In 2022 it was a research curiosity requiring hours of training data.
Speech-to-Text Word Error Rate Explained
Word Error Rate — WER — is the dominant quality metric for speech-to-text. Every STT vendor reports WER. Every evaluation report ranks models by WER. Most voice agent engineers know the term but have at best a fuzzy sense of what the number really means in production.
Text-to-Speech in 2026: The State of the Art
Text-to-speech in 2026 has crossed a threshold most people alive today didn't expect to see. Blind A/B tests consistently show that 70–85% of listeners can't reliably distinguish synthetic voices from real recordings of humans.