Voice Activity Detection in Production Voice Agents
Voice Activity Detection — VAD — is the unglamorous infrastructure deciding when the caller has started speaking, when they've paused, and when they're definitively done. It sits upstream of STT, LLM, and TTS, but bad VAD can ruin an otherwise excellent voice agent.
Voice Activity Detection — VAD — is the unglamorous infrastructure deciding when the caller has started speaking, when they've paused, and when they're definitively done. It sits upstream of STT, LLM, and TTS, but bad VAD can ruin an otherwise excellent voice agent. Get VAD too aggressive and you cut off mid-sentence. Too conservative and the response feels sluggish. Good VAD is one of the most underrated quality levers.
TL;DR
- VAD distinguishes speech from non-speech in the audio stream.
- Endpointing (VAD + timing) decides when caller finished.
- Tune aggressively for fast conversation; conservatively for information capture.
- Modern VAD uses lightweight neural networks; works well on phone audio.
- Bad VAD manifests as interruption or sluggishness; test both.
What VAD does
Continuously:
- Analyzes audio stream for speech.
- Outputs per-frame speech/silence probability.
- Feeds endpointing logic.
Endpointing:
- When speech started → feed STT.
- When speech paused → wait.
- When silence persists → endpoint (caller done).
The VAD-endpointing relationship
VAD: low-level (this frame is speech? yes/no).
Endpointing: higher-level (caller is done speaking? yes/no/maybe).
Endpointing uses VAD output plus timing logic:
- How much silence = end?
- What counts as "speech returned"?
- Confidence thresholds.
Tuning windows
Silence-to-endpoint:
- Fast (300-500ms): quick response, cuts off thinking callers.
- Medium (500-700ms): balanced.
- Slow (700-1000ms): patient, feels sluggish.
Pick per use case.
VAD algorithms
Energy-based. Simple. Threshold on audio energy. Susceptible to noise.
Neural VAD. Modern. Small neural network classifies frames. Robust to noise.
Combined. Energy as first-pass, neural for confidence.
Most production VAD is neural. Examples: Silero VAD, WebRTC VAD (older).
Handling noise
Common challenges:
- Background conversation. Detected as speech.
- Keyboard typing.
- Music on hold.
- Dogs barking, TV, traffic.
Good VAD filters noise; bad VAD interprets as speech, delaying endpoints.
See how background noise affects voice agent accuracy.
Where VAD runs
Client-side: In the voice agent's endpoint (Twilio layer, Vapi, etc.). Lowest latency.
Server-side: STT engine handles internally. Less control.
Hybrid: Client detects speech start; server endpoints. Optimal.
Silence tolerance
Callers pause mid-thought:
- Brief (under 300ms): part of their utterance.
- Medium (300-600ms): usually thinking.
- Long (600-1000ms): often done.
Endpointing logic treats these differently.
False starts and abandonments
Caller says "I want to..." then pauses:
- VAD detects speech end after pause.
- Endpointing fires.
- Agent responds to "I want to."
- Caller continues: "...book an appointment."
Handling:
- Accept the interruption as a new turn.
- Or extend endpoint window when speech was clearly incomplete.
Context-aware endpointing
More sophisticated:
- Classifier analyzes the transcript.
- If sentence feels incomplete: wait longer.
- If sentence feels complete: endpoint fast.
"I want to book" → probably incomplete → extend. "Yes." → complete → short endpoint.
Barge-in detection
Caller speaks while agent is speaking:
- VAD detects caller speech during TTS.
- System stops TTS (near-instant).
- System processes new caller input.
See turn-taking and barge-in: the mechanics of natural conversation.
VAD quality metrics
- False positive rate. Triggers on non-speech.
- False negative rate. Misses real speech.
- Latency. Time from speech start to VAD detection.
- Robustness to noise.
Balance these per use case.
Testing VAD
- Sample real calls across various environments.
- Verify endpoints match caller intent.
- Check for mid-sentence cutoffs.
- Check for delayed responses.
- Noise scenarios.
Automated testing harness helps.
Use-case tuning
Customer support: Medium endpointing (600-700ms). Caller may pause when explaining.
Transactional (booking, payment): Faster endpointing (450-550ms). Short answers expected.
Discovery / qualification: Medium-long endpointing (700ms). Caller is thinking.
Outbound sales: Fast (500ms). Keep pace.
Language considerations
Some languages have:
- Longer average sentences (endpointing too fast = cutoffs).
- More filler (mistaken for real speech).
- Different speech rhythms.
Tune per language.
Elderly and accessibility
Slower speakers:
- Need longer endpointing.
- Deserve patience.
Either tune conservatively by default, or detect pace and adapt.
Common pitfalls
Default VAD. Vendor defaults are conservative. Consider tuning.
Same VAD for all use cases. One-size fits nothing perfectly.
Ignoring noise environments. Works in office; fails from cars, offices, kids.
No testing with real audio. Silent quality issues.
Over-aggressive. Caller cut off mid-thought. Complaints.
The mic and network factor
VAD quality depends on:
- Microphone quality. Varies widely.
- Network quality. Dropped packets hurt.
- Audio codec. Some codecs degrade VAD signals.
Client-side quality influences server-side VAD.
Debugging
When VAD fails:
- Listen to the audio.
- Check VAD output per frame.
- Check endpoint decisions.
- Identify pattern (always cuts off? only with background noise?).
Fix by tuning parameters.
Related reading
- Text-to-Speech in 2026: The State of the Art
- Latency Engineering for Real-Time Voice Agents
- Streaming Audio Over WebRTC for Voice Agents
- Comparing Neural TTS Architectures
FAQ
Should we tune VAD per caller? Possible but complex. Usually per-deployment is enough.
Can AI learn optimal endpointing? Research area. Manual tuning dominates in 2026.
What about bilingual callers switching languages? VAD is language-agnostic at frame level; endpointing is too.
How often should we review VAD? Monthly sampling; deeper when adding use cases.
Do VAD tools work over VoIP? Yes — designed for it. Test specifically on your audio pipeline.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all →Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport — lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
Comparing Neural TTS Architectures
Neural TTS has evolved rapidly since 2018 — Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements.
Phoneme-Level Tuning for Voice Agents
Most voice agent quality work happens at the text level — prompt engineering, SSML, pronunciation dictionaries. But sometimes the right layer is deeper: phonemes, the individual sound units of spoken language.
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
