🔊

Speech Technology

TTS, STT, voice cloning, latency engineering, and the hard parts of making AI sound human.

24 articles

🔊 Speech Tech

Streaming Audio Over WebRTC for Voice Agents

WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport — lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.

Tyler Weitzman · Mar 21, 2026 · 5 min
🔊 Speech Tech

How to Benchmark a Voice Agent's End-to-End Latency

Vendor-reported latency is a lab number. What matters for your voice agent is measured latency in your production environment, under real network conditions, with your actual content.

Tyler Weitzman · Mar 21, 2026 · 5 min
🔊 Speech Tech

Comparing Neural TTS Architectures

Neural TTS has evolved rapidly since 2018 — Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements.

Tyler Weitzman · Mar 20, 2026 · 5 min
🔊 Speech Tech

Phoneme-Level Tuning for Voice Agents

Most voice agent quality work happens at the text level — prompt engineering, SSML, pronunciation dictionaries. But sometimes the right layer is deeper: phonemes, the individual sound units of spoken language.

Tyler Weitzman · Mar 19, 2026 · 4 min
🔊 Speech Tech

Why Some Voices Sound Robotic Even in 2026

TTS in 2026 should sound natural. Most of the time it does. But occasionally a synthetic voice still gives itself away — a weird pause, a flat delivery, a strange pronunciation. Understanding why it happens, and what to do about it, is part of the voice engineering discipline.

Tyler Weitzman · Mar 19, 2026 · 5 min
🔊 Speech Tech

Voice Cloning for Customer Brands: A Buyer's Guide

Voice cloning has become cheap enough that every company with a voice channel is asking the same question: should we use a custom brand voice instead of a stock voice model?

Cliff Weitzman · Mar 18, 2026 · 5 min
🔊 Speech Tech

How Sample Rate Affects Voice Agent Quality

Sample rate is one of those low-level audio details that voice agent builders often inherit without thinking about. The STT config says 16 kHz; the TTS outputs 24 kHz; the PSTN leg is 8 kHz.

Tyler Weitzman · Mar 18, 2026 · 5 min
🔊 Speech Tech

Echo Cancellation in Real-Time Voice AI

Echo in voice agent calls sounds like this: agent starts speaking, caller's speaker plays agent's voice, caller's microphone picks up agent's voice, the audio flows back to the agent, agent's STT transcribes its own speech, agent gets confused, conversation breaks down.

Tyler Weitzman · Mar 17, 2026 · 5 min
🔊 Speech Tech

How Background Noise Affects Voice Agent Accuracy

Production voice agents live in noisy environments. Callers call from cars, offices, restaurants, kitchens with running faucets, grocery stores with loud music, outdoor job sites. Real audio has sirens, barking dogs, other conversations, and TV in the background.

Tyler Weitzman · Mar 17, 2026 · 4 min
🔊 Speech Tech

Audio Codecs for Voice Agents: Opus, PCMU, and More

Audio codecs determine the quality, bandwidth, and latency of every voice agent call. The choice between G.711, Opus, G.722, and others affects how your audio sounds over the line, how much bandwidth you consume, and how well STT and TTS perform.

Tyler Weitzman · Mar 16, 2026 · 5 min
🔊 Speech Tech

Diarization: Knowing Who's Speaking in a Voice Conversation

Speaker diarization is the task of answering "who spoke when?" Given audio with multiple speakers, diarization outputs time-stamped segments labeled by speaker. For most voice agent use cases — one caller, one agent — diarization is trivial (channel-based separation works).

Tyler Weitzman · Mar 16, 2026 · 5 min
🔊 Speech Tech

Voice Activity Detection in Production Voice Agents

Voice Activity Detection — VAD — is the unglamorous infrastructure deciding when the caller has started speaking, when they've paused, and when they're definitively done. It sits upstream of STT, LLM, and TTS, but bad VAD can ruin an otherwise excellent voice agent.

Tyler Weitzman · Mar 15, 2026 · 5 min
🔊 Speech Tech

The Engineering Behind Sub-Second Voice Agents

Sub-second voice agents — end-to-end latency under 1000ms from caller speech end to agent speech start — used to be aspirational. In 2026 it's table stakes for production voice AI, and leading deployments are hitting sub-500ms.

Tyler Weitzman · Mar 15, 2026 · 4 min
🔊 Speech Tech

How STT Handles Disfluencies and Filler Words

Real speech is messy. People say "um," "uh," "like," and "you know" constantly. They start sentences and abandon them. They repeat themselves. They mumble and correct.

Tyler Weitzman · Mar 14, 2026 · 5 min
🔊 Speech Tech

Multilingual TTS: Choosing a Voice Model

Multilingual text-to-speech in 2026 is good but uneven. English is excellent. Spanish, French, German, Mandarin, Japanese are strong. Beyond the top 10 languages, quality drops noticeably.

Tyler Weitzman · Mar 14, 2026 · 4 min
🔊 Speech Tech

Why TTS Quality Plateaus and How to Push Past It

Every voice AI team eventually hits the TTS quality plateau. You pick a good TTS provider, tune some basics, and quality is... fine. Not amazing, not bad. Specific edge cases stay wrong. Certain phrases sound robotic. Numbers get weird. Tone lacks variation.

Tyler Weitzman · Mar 13, 2026 · 5 min
🔊 Speech Tech

How TTS Models Handle Numbers, Dates, and Acronyms

Numbers, dates, and acronyms are the trickiest content for TTS. "Dr. Smith will see you on 3/12/2026 for your $47.50 copay" seems simple until you realize the model has to decide: is "3/12" a date or a fraction? Is "$47.50" dollars or just numbers? Is "Dr." "Doctor" or "Drive"?

Tyler Weitzman · Mar 13, 2026 · 5 min
🔊 Speech Tech

Streaming STT: How to Cut Recognition Latency

Non-streaming speech-to-text works for transcription — you submit audio, wait, get a transcript. That pattern is fine for batch use cases but fatal for voice agents.

Tyler Weitzman · Mar 12, 2026 · 5 min
🔊 Speech Tech

Streaming TTS: How to Cut First-Audio Latency

First-audio latency — the time from when the TTS receives text to when the caller hears the first sound — is one of the biggest levers in voice agent latency optimization.

Tyler Weitzman · Mar 12, 2026 · 5 min
🔊 Speech Tech

Latency Engineering for Real-Time Voice Agents

Latency is what separates voice agents that feel conversational from those that feel broken. Humans expect responses within 700ms of finishing a sentence — anything longer triggers a "did they hear me?" reaction. Sub-500ms feels alive. Sub-300ms feels exceptional.

Tyler Weitzman · Mar 11, 2026 · 5 min
🔊 Speech Tech

Voice Cloning Ethics: A Practical Framework

Voice cloning technology moved from research lab to commodity in roughly 18 months. The legal framework has lagged, the industry ethical consensus lags further, and individual practitioners are left to make judgment calls in a space where the wrong choice harms real people.

Cliff Weitzman · Mar 10, 2026 · 6 min
🔊 Speech Tech

Voice Cloning: How It Works and Why It Matters

Voice cloning — the technology to replicate a specific person's voice from a short audio sample — has been one of the most disruptive developments in voice AI. In 2022 it was a research curiosity requiring hours of training data.

Tyler Weitzman · Mar 10, 2026 · 5 min
🔊 Speech Tech

Speech-to-Text Word Error Rate Explained

Word Error Rate — WER — is the dominant quality metric for speech-to-text. Every STT vendor reports WER. Every evaluation report ranks models by WER. Most voice agent engineers know the term but have at best a fuzzy sense of what the number really means in production.

Tyler Weitzman · Mar 9, 2026 · 5 min
🔊 Speech Tech

Text-to-Speech in 2026: The State of the Art

Text-to-speech in 2026 has crossed a threshold most people alive today didn't expect to see. Blind A/B tests consistently show that 70–85% of listeners can't reliably distinguish synthetic voices from real recordings of humans.

Tyler Weitzman · Mar 9, 2026 · 4 min