
Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
Articles by Tyler Weitzman (98)
Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Compliance and Accessibility for Government Voice AI
Government voice AI has two compliance layers most commercial deployments don't: a set of federal accessibility standards that are legally binding (Section 508, ADA), and a patchwork of privacy and security rules that vary by agency, level of government, and type of data.
Voice Agents for Loan Servicing and Collections
Loan servicing and collections is one of the highest-volume, most-regulated phone channels in finance. Every month, hundreds of millions of calls flow between lenders and borrowers about payments due, payments missed, hardship, and resolution.
Compliance Considerations for AI Voice in Banking
Banking is the most heavily regulated industry where voice AI is seeing meaningful deployment. A misstep on compliance here doesn't just create legal exposure — it triggers regulator attention that can chill your entire program.
HIPAA Compliance for AI Voice Agents in Healthcare
HIPAA compliance is the first gate for any voice AI deployment in US healthcare. Get it wrong and you're exposed to federal penalties, state attorney-general actions, and class-action litigation.
How to Integrate Voice Agents with a Custom REST API
Most voice agent integrations are with off-the-shelf systems — Salesforce, HubSpot, Zendesk, Stripe. But eventually every production deployment needs to integrate with a custom internal API — the billing system, the proprietary order management, the ops dashboard that only your…
Sending Voice Agent Transcripts to Slack
Slack is where most teams live in 2026, and for voice agent deployments, getting call transcripts and key events into Slack closes a critical ops loop. Escalations land in the right channel with context. QA reviews happen where the team already works.
Connecting Voice Agents to Snowflake or BigQuery
Voice agent deployments generate a lot of data. Every call produces a transcript, metadata (duration, outcome, caller info), function-call traces, sentiment signals, and operational metrics.
How to Port a Phone Number to Your Voice Agent
You already have a phone number that customers know. Your main line, published on your website, business cards, Google. You can't afford to change it.
Setting Up Toll-Free Verification for AI Calling
Toll-free numbers (800, 888, 877, 866, 855, 844, 833) carry a compliance requirement that catches many voice AI deployments off-guard: before you can reliably send SMS or initiate high-volume outbound voice traffic from a toll-free number, you need carrier verification.
SIP vs WebRTC for Voice Agents
SIP and WebRTC are the two dominant technologies for real-time voice in 2026. Most voice agent deployments use one, the other, or both. Deciding which to use for a given integration depends on where the call originates, what network conditions you expect, and how much control…
How to Use Twilio Studio with AI Voice Agents
Twilio Studio is Twilio's visual flow builder for call (and SMS) workflows. It lets you drag-and-drop a call flow — gather digits, branch on logic, route to agents, trigger webhooks — without writing code.
Bring Your Own Twilio: Pros, Cons, and Setup
Bring Your Own Twilio (BYO) is the architecture where your voice agent platform (Vapi, Retell, Simba, SIMBA) connects to your Twilio account rather than using the vendor's managed Twilio setup.
Connecting Voice Agents to Stripe for Payments
Taking payments over the phone is a workflow that voice agents get asked to handle constantly — bill payments, copays, service fees, subscription changes, you name it.
Sending SMS Follow-Ups from Voice Agents
SMS follow-ups are one of the highest-ROI additions to any voice agent deployment. The caller just had a conversation; they know the appointment time, the tracking link, the next step. But people forget.
Calendar Integrations: Cal.com, Google, Outlook
Voice agents that book, reschedule, or cancel appointments live or die on their calendar integration. A voice agent that guesses at availability or writes to the wrong calendar breaks the workflow it was built for.
Webhooks 101 for Voice Agents
Webhooks are the backbone of voice agent integrations. When your voice agent needs to call a CRM, update a ticket, send an SMS, or trigger any external action, it does so via HTTP — and most of those HTTP calls are structured as webhooks or webhook-like REST operations.
Connecting Voice Agents to Zendesk
Zendesk is the dominant ticketing and support platform for mid-market and enterprise customer service, and it's where most voice agent-handled support interactions need to land.
Connecting Voice Agents to Intercom
Intercom is the messaging-first customer communication platform that a lot of SaaS companies run their support on. Historically chat-centric, it's expanded to cover email, a light voice layer, and AI-native tools (Fin).
Connecting Voice Agents to HubSpot CRM
HubSpot is the CRM of choice for a large share of SMB and mid-market SaaS companies, and increasingly for mid-market customers in other verticals. Its API is cleaner than Salesforce's, its data model is simpler, and integrations are generally less painful.
Connecting Voice Agents to Salesforce CRM
Salesforce is the de facto CRM for most mid-market and enterprise companies deploying voice AI. If your agent is doing anything meaningful in a business context — handling sales inquiries, supporting customers, qualifying leads, processing orders — there's a good chance the…
SIP Trunking 101 for Voice Agent Builders
SIP trunking is the unsexy plumbing that makes voice agents work at scale. It's the protocol and infrastructure that lets calls move between the public phone network and your voice AI without relying on a telephony provider's proprietary APIs.
Twilio + Voice Agents: A Complete Guide
Twilio is the dominant telephony backbone under most voice agent deployments. If you're building on Vapi, Retell, Simba, OpenAI Realtime, or SIMBA, odds are your calls flow through Twilio at some point.
Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport — lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
How to Benchmark a Voice Agent's End-to-End Latency
Vendor-reported latency is a lab number. What matters for your voice agent is measured latency in your production environment, under real network conditions, with your actual content.
Comparing Neural TTS Architectures
Neural TTS has evolved rapidly since 2018 — Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements.
Phoneme-Level Tuning for Voice Agents
Most voice agent quality work happens at the text level — prompt engineering, SSML, pronunciation dictionaries. But sometimes the right layer is deeper: phonemes, the individual sound units of spoken language.
Why Some Voices Sound Robotic Even in 2026
TTS in 2026 should sound natural. Most of the time it does. But occasionally a synthetic voice still gives itself away — a weird pause, a flat delivery, a strange pronunciation. Understanding why it happens, and what to do about it, is part of the voice engineering discipline.
How Sample Rate Affects Voice Agent Quality
Sample rate is one of those low-level audio details that voice agent builders often inherit without thinking about. The STT config says 16 kHz; the TTS outputs 24 kHz; the PSTN leg is 8 kHz.
Echo Cancellation in Real-Time Voice AI
Echo in voice agent calls sounds like this: agent starts speaking, caller's speaker plays agent's voice, caller's microphone picks up agent's voice, the audio flows back to the agent, agent's STT transcribes its own speech, agent gets confused, conversation breaks down.
How Background Noise Affects Voice Agent Accuracy
Production voice agents live in noisy environments. Callers call from cars, offices, restaurants, kitchens with running faucets, grocery stores with loud music, outdoor job sites. Real audio has sirens, barking dogs, other conversations, and TV in the background.
Audio Codecs for Voice Agents: Opus, PCMU, and More
Audio codecs determine the quality, bandwidth, and latency of every voice agent call. The choice between G.711, Opus, G.722, and others affects how your audio sounds over the line, how much bandwidth you consume, and how well STT and TTS perform.
Diarization: Knowing Who's Speaking in a Voice Conversation
Speaker diarization is the task of answering "who spoke when?" Given audio with multiple speakers, diarization outputs time-stamped segments labeled by speaker. For most voice agent use cases — one caller, one agent — diarization is trivial (channel-based separation works).
Voice Activity Detection in Production Voice Agents
Voice Activity Detection — VAD — is the unglamorous infrastructure deciding when the caller has started speaking, when they've paused, and when they're definitively done. It sits upstream of STT, LLM, and TTS, but bad VAD can ruin an otherwise excellent voice agent.
The Engineering Behind Sub-Second Voice Agents
Sub-second voice agents — end-to-end latency under 1000ms from caller speech end to agent speech start — used to be aspirational. In 2026 it's table stakes for production voice AI, and leading deployments are hitting sub-500ms.
How STT Handles Disfluencies and Filler Words
Real speech is messy. People say "um," "uh," "like," and "you know" constantly. They start sentences and abandon them. They repeat themselves. They mumble and correct.
Multilingual TTS: Choosing a Voice Model
Multilingual text-to-speech in 2026 is good but uneven. English is excellent. Spanish, French, German, Mandarin, Japanese are strong. Beyond the top 10 languages, quality drops noticeably.
Why TTS Quality Plateaus and How to Push Past It
Every voice AI team eventually hits the TTS quality plateau. You pick a good TTS provider, tune some basics, and quality is... fine. Not amazing, not bad. Specific edge cases stay wrong. Certain phrases sound robotic. Numbers get weird. Tone lacks variation.
How TTS Models Handle Numbers, Dates, and Acronyms
Numbers, dates, and acronyms are the trickiest content for TTS. "Dr. Smith will see you on 3/12/2026 for your $47.50 copay" seems simple until you realize the model has to decide: is "3/12" a date or a fraction? Is "$47.50" dollars or just numbers? Is "Dr." "Doctor" or "Drive"?
Streaming STT: How to Cut Recognition Latency
Non-streaming speech-to-text works for transcription — you submit audio, wait, get a transcript. That pattern is fine for batch use cases but fatal for voice agents.
Streaming TTS: How to Cut First-Audio Latency
First-audio latency — the time from when the TTS receives text to when the caller hears the first sound — is one of the biggest levers in voice agent latency optimization.
Latency Engineering for Real-Time Voice Agents
Latency is what separates voice agents that feel conversational from those that feel broken. Humans expect responses within 700ms of finishing a sentence — anything longer triggers a "did they hear me?" reaction. Sub-500ms feels alive. Sub-300ms feels exceptional.
Voice Cloning: How It Works and Why It Matters
Voice cloning — the technology to replicate a specific person's voice from a short audio sample — has been one of the most disruptive developments in voice AI. In 2022 it was a research curiosity requiring hours of training data.
Speech-to-Text Word Error Rate Explained
Word Error Rate — WER — is the dominant quality metric for speech-to-text. Every STT vendor reports WER. Every evaluation report ranks models by WER. Most voice agent engineers know the term but have at best a fuzzy sense of what the number really means in production.
Text-to-Speech in 2026: The State of the Art
Text-to-speech in 2026 has crossed a threshold most people alive today didn't expect to see. Blind A/B tests consistently show that 70–85% of listeners can't reliably distinguish synthetic voices from real recordings of humans.
Connecting Voice Lead Qual to HubSpot
HubSpot is the CRM of choice for most SMB and mid-market SaaS, and voice-AI-qualified leads land there more often than in Salesforce for that segment. HubSpot's data model is cleaner than Salesforce's, the API is friendlier, and the integration workload is lower.
Connecting Voice Lead Qual to Salesforce
Salesforce is where most enterprise sales teams live. For voice-AI-qualified leads to generate real pipeline, they have to land in Salesforce cleanly — right object type, right owner, right stage, right custom fields populated.
How to Score Leads From a Voice Conversation
A voice conversation is a rich source of signal for lead scoring — far richer than a form submission or a website visit. The caller tells you their role, their company, their need, their timeline, and their tone.
Outbound Agent Metrics That Actually Matter
Outbound voice AI deployments can produce dashboards dense with metrics. Calls dialed, calls answered, average handle time, average time to first word, sentiment score, coverage rate, disposition breakdown, opt-out rate, compliance incident rate. Many of these are interesting.
How to Build a Compliant Outbound Voice Agent in 30 Days
Getting an outbound AI voice agent live in 30 days sounds ambitious — but it's achievable for focused deployments. The critical path is compliance setup (TCPA, A2P 10DLC, number verification), not technology. The voice AI itself can be configured in a week.
Caller ID and Trust: Why Numbers Get Marked as Spam
You deploy an outbound voice AI campaign. First week goes great. Second week, answer rates drop 40%. Third week, your phone numbers start showing up as "Scam Likely" on caller ID. What happened?
DTMF and IVR Navigation for Outbound Voice Agents
Outbound voice agents calling businesses often encounter IVR systems — "press 1 for sales, press 2 for support" phone trees that the AI needs to navigate to reach the right person.
A2P 10DLC Explained for Voice Agent Builders
If your voice agent sends SMS from a standard 10-digit US phone number, A2P 10DLC compliance is part of your stack — whether you know it or not.
TCPA Compliance for AI-Powered Outbound Calls
TCPA — the Telephone Consumer Protection Act — is the federal law that governs automated and pre-recorded outbound calls in the United States. AI-generated voice calls fall squarely under TCPA's stricter rules for "artificial or prerecorded voice" messages.
How to Tag and Categorize AI Conversations
Conversation tagging is what turns thousands of AI-handled calls into actionable insight. Every call should get tagged with intent, outcome, sentiment, and any anomalies — automatically, consistently, and in a way that supports both real-time routing and after-the-fact…
Quality Assurance for AI Voice Support
Quality assurance for AI voice support is mostly the same as QA for human contact centers — but with different staffing, different tools, and a much higher possible cadence. Done well, AI QA closes the loop between observation and prompt iteration in days instead of months.
Why First-Contact Resolution Is the North Star for AI Support
If you can only track one metric for AI customer support, it should be First-Contact Resolution (FCR). Not deflection. Not handle time. Not even CSAT.
CSAT for AI Agents: Benchmarks and Frameworks
Customer Satisfaction (CSAT) is the closest thing to a north star for support agents. Tracking it for AI agents specifically — and comparing it against human-handled equivalents — is the single most useful operational habit for any team running customer-facing AI.
What Is AI Deflection (and How to Measure It)
"Deflection" is the most-cited and most-misunderstood metric in AI customer support. Vendors quote 80% deflection rates. Buyers don't always know what that means or how to verify it.
How to Handle Personally Identifiable Information in Voice Agents
Voice agents collect PII constantly — names, phone numbers, addresses, dates of birth, account numbers, sometimes even social security numbers and credit cards. Handling this responsibly isn't optional.
Open-Source vs Closed-Source LLMs for Voice Agents
The open-source LLM ecosystem caught up to closed models faster than anyone expected. Llama 3.3, Mistral, Qwen — all good enough for most voice agent use cases.
How LLMs Decide What to Say Next in a Voice Conversation
Step inside the LLM's "head" for a moment and look at how it picks what to say on each turn of a voice call. The answer is less mysterious than the term "AI" suggests and more interesting than "next-token prediction" implies.
Red-Teaming Your Voice Agent
Red-teaming is the practice of deliberately trying to break your voice agent before adversaries (or just confused customers) do it for you. Most teams skip it. The ones that do it find embarrassing failures fast — and fix them before they cost real money.
Building a Conversation Memory Layer for Voice Agents
The model has no memory beyond what you put in its context window. For a 5-minute support call this is fine. For longer calls, multi-call interactions, or agents that need to remember preferences across sessions, you need an explicit memory layer.
Why Context Windows Matter Less Than You Think for Voice
LLM marketing has been all about context window expansion — 128K, 200K, 1M, 2M tokens. For voice agents, this race mostly doesn't matter. Voice conversations rarely exceed 5,000 tokens of meaningful context.
How to A/B Test Voice Agent Prompts
Most teams don't A/B test voice agent prompts. They tweak the prompt, listen to a few calls, and ship if it "feels better." This works until it doesn't — until a tweak that helps one use case silently breaks another.
Streaming LLM Outputs to Voice: The Engineering
Streaming the LLM's output to TTS as it generates is the difference between a snappy voice agent and a sluggish one. The basic idea is simple: don't wait for the model to finish thinking before you start speaking.
The Role of Embeddings in Voice Agent Knowledge
Embeddings are the numerical representations of text that make retrieval-augmented generation work. Most voice agent builders never have to think about embeddings directly — their platform handles them.
Multi-Agent Architectures for Customer Service
When a single agent gets too complex — too many intents, too many tools, conflicting style requirements — teams reach for multi-agent architectures. A "router" or "supervisor" routes turns to specialized sub-agents (a billing expert, a tech support expert, a returns expert).
How to Stop a Voice Agent from Hallucinating
Hallucination is the failure mode that scares everyone off voice AI faster than anything else. The agent confidently tells a customer the wrong policy, the wrong price, or makes up a refund.
Designing System Prompts for Multi-Turn Voice Conversations
The system prompt is the single most-iterated artifact in any production voice agent. It's where most of the agent's personality, rules, and reliability live. Most teams underinvest here, treating the prompt as a "set it and forget it" string.
Tool Use vs Function Calling: What's the Difference?
You'll hear "tool use" and "function calling" used interchangeably in voice agent docs. They mean roughly the same thing. The reason both terms exist is mostly historical — different vendors named the same idea differently.
Why Smaller LLMs Often Win for Voice Agents
There's a strong reflex in AI: bigger model = better outcome. For voice agents specifically, this reflex is often wrong. A fast 8B parameter model with sub-200ms time-to-first-token can outperform a 70B frontier model on nearly every voice metric that matters.
Guardrails for Voice Agents: A Pragmatic Take
Guardrails are the rules that prevent your voice agent from doing things it shouldn't — agreeing to refunds it can't authorize, giving medical advice, leaking PII, or making up policies.
Retrieval-Augmented Generation for Voice Agents
RAG — retrieval-augmented generation — is the standard pattern for grounding an LLM in a specific knowledge base. For voice agents, RAG works the same as for chatbots, with one crucial difference: every millisecond of retrieval latency shows up in the conversation.
LLM Evaluation for Conversational Agents
You can't tune what you can't measure. Evaluation is the unsexy work that separates voice agent teams shipping production-quality work from teams flying blind. Most teams underinvest here for the first few months, then have a wake-up moment when something breaks.
How to Give a Voice Agent Long-Term Memory
By default, voice agents have no memory beyond the current call. The caller hangs up, the agent forgets everything. For many use cases this is fine. For loyalty-driven businesses where the same caller comes back repeatedly, it's a missed opportunity.
Prompt Engineering for Voice (vs Text) Agents
If you've written prompts for chatbots, you have a head start on voice agents — but only halfway. The fundamentals of clear instructions and tool definitions carry over. The style guide, the latency considerations, and the failure-mode handling are very different.
Function Calling for Voice Agents: A Practical Guide
Function calling is the feature that turns a voice agent from a chatbot with audio into an actual worker. Without it, the agent can talk about looking up your account; with it, the agent can actually do it.
How Large Language Models Power Voice Agents
When people ask "what's inside a voice agent?" they usually want to hear about the LLM. That's fair — the LLM is the most visible new piece of the stack.
The Hidden Complexity of Numbers in Voice Agents
Numbers are the most underestimated source of pain in voice AI. Phone numbers, account numbers, dates, prices, addresses — all of them have edge cases that turn a clean conversation into a back-and-forth of "no, one nine seven, not nineteen seven." The fix isn't a better LLM;…
How Voice Agents Handle Accents and Dialects
Voice AI is great at standard American English. It's pretty good at standard British, Australian, and Indian English. It's variably good at everything else.
How to Measure Voice Agent Quality
Most voice agent teams measure the wrong things. They watch deflection rate and call duration; they ignore the quality of what happened inside the call. The result: agents that look good on dashboards and feel bad on the phone.
The Difference Between Streaming and Non-Streaming Voice Agents
Streaming is the most underrated word in voice AI. The difference between a streaming and a non-streaming pipeline is the difference between a voice agent that feels alive and one that feels like a slow walkie-talkie.
How Voice Agents Recover from Misunderstandings
Real conversations have misunderstandings. The agent mishears a name, asks the wrong clarifying question, or jumps to the wrong intent. How the agent recovers matters more than how often it stumbles. A graceful recovery can leave the caller feeling like the agent is competent.
How Voice Agents Decide When to Stop Talking
A voice agent that doesn't know when to shut up is one of the most annoying things in software. Even if every word is right, an agent that talks past the moment when the caller wanted to interject feels worse than no agent at all.
Synchronous vs Asynchronous Voice Agents
Most voice agents are synchronous: a real-time phone call where the agent and the caller exchange turns immediately. But there's a quietly growing class of asynchronous voice agents — voice messaging, voicemail-style interactions, scheduled callbacks.
What Makes a Voice Agent "Production Ready"
A voice agent that works in a demo is a different product from one that works in production. The demo only has to handle the happy path with a friendly tester.
Why Voice Agents Sound More Human Every Year
Five years ago, you could spot a synthetic voice in three seconds. Today the best ones can run a 5-minute conversation without anyone noticing.
How Voice Agents Differ from Voice Assistants
Siri, Alexa, and Google Assistant are voice assistants. The system that picks up your dentist's phone and books your cleaning is a voice agent. Both involve talking to a computer, but they're different products with different design constraints.
How Voice Agents Handle Interruptions Gracefully
Interruption handling is the single most-felt UX detail in voice AI. Done well, the agent feels conversational and responsive. Done poorly, the agent runs over you, doesn't notice, and you end up shouting at your phone. This is the engineering and design behind getting it right.
The Anatomy of a Voice Agent Pipeline
If you took every voice agent in production today and dissected them, you'd find roughly the same skeleton. The names change. The vendors change. The plumbing details vary.
Turn-Taking and Barge-In: The Mechanics of Natural Conversation
Two humans on a phone call don't take turns the way a tennis match does. They overlap. They interrupt. They finish each other's sentences. They leave 200ms gaps between turns and call it polite. A voice agent that can't do this — even if every word is correct — feels broken.
Latency in Voice AI: Why Sub-500ms Matters
When two humans talk, the gap between one person finishing a sentence and the other starting their reply is tiny — usually around 200ms. Sometimes the next person starts speaking before the first person has actually finished, predicting the end of the sentence.
Voice Agents vs Chatbots: When to Use Which
A chatbot is a turn-based text exchange with no real-time pressure. A voice agent is a real-time spoken conversation with a tight latency budget and a much messier input channel.
How a Conversational Voice Agent Actually Works (Under the Hood)
If you open the box on a modern voice agent, you'll find roughly four moving parts: a streaming speech recognizer, a language model, a text-to-speech engine, and a turn-taking referee that decides whose turn it is to speak. None of that is exotic on its own.