Voice Activity Detection — VAD — is the unglamorous infrastructure deciding when the caller has started speaking, when they've paused, and when they're definitively done. It sits upstream of STT, LLM, and TTS, but bad VAD can ruin an otherwise excellent voice agent. Get VAD too aggressive and you cut off mid-sentence. Too conservative and the response feels sluggish. Good VAD is one of the most underrated quality levers.

TL;DR

VAD distinguishes speech from non-speech in the audio stream.
Endpointing (VAD + timing) decides when caller finished.
Tune aggressively for fast conversation; conservatively for information capture.
Modern VAD uses lightweight neural networks; works well on phone audio.
Bad VAD manifests as interruption or sluggishness; test both.

What VAD does

Continuously:

Analyzes audio stream for speech.
Outputs per-frame speech/silence probability.
Feeds endpointing logic.

Endpointing:

When speech started → feed STT.
When speech paused → wait.
When silence persists → endpoint (caller done).

The VAD-endpointing relationship

VAD: low-level (this frame is speech? yes/no).

Endpointing: higher-level (caller is done speaking? yes/no/maybe).

Endpointing uses VAD output plus timing logic:

How much silence = end?
What counts as "speech returned"?
Confidence thresholds.

Tuning windows

Silence-to-endpoint:

Fast (300-500ms): quick response, cuts off thinking callers.
Medium (500-700ms): balanced.
Slow (700-1000ms): patient, feels sluggish.

Pick per use case.

VAD algorithms

Energy-based. Simple. Threshold on audio energy. Susceptible to noise.

Neural VAD. Modern. Small neural network classifies frames. Robust to noise.

Combined. Energy as first-pass, neural for confidence.

Most production VAD is neural. Examples: Silero VAD, WebRTC VAD (older).

Handling noise

Common challenges:

Background conversation. Detected as speech.
Keyboard typing.
Music on hold.
Dogs barking, TV, traffic.

Good VAD filters noise; bad VAD interprets as speech, delaying endpoints.

See how background noise affects voice agent accuracy.

Where VAD runs

Client-side: In the voice agent's endpoint (Twilio layer, Vapi, etc.). Lowest latency.

Server-side: STT engine handles internally. Less control.

Hybrid: Client detects speech start; server endpoints. Optimal.

Silence tolerance

Callers pause mid-thought:

Brief (under 300ms): part of their utterance.
Medium (300-600ms): usually thinking.
Long (600-1000ms): often done.

Endpointing logic treats these differently.

False starts and abandonments

Caller says "I want to..." then pauses:

VAD detects speech end after pause.
Endpointing fires.
Agent responds to "I want to."
Caller continues: "...book an appointment."

Handling:

Accept the interruption as a new turn.
Or extend endpoint window when speech was clearly incomplete.

Context-aware endpointing

More sophisticated:

Classifier analyzes the transcript.
If sentence feels incomplete: wait longer.
If sentence feels complete: endpoint fast.

"I want to book" → probably incomplete → extend. "Yes." → complete → short endpoint.

Barge-in detection

Caller speaks while agent is speaking:

VAD detects caller speech during TTS.
System stops TTS (near-instant).
System processes new caller input.

See turn-taking and barge-in: the mechanics of natural conversation.

VAD quality metrics

False positive rate. Triggers on non-speech.
False negative rate. Misses real speech.
Latency. Time from speech start to VAD detection.
Robustness to noise.

Balance these per use case.

Testing VAD

Sample real calls across various environments.
Verify endpoints match caller intent.
Check for mid-sentence cutoffs.
Check for delayed responses.
Noise scenarios.

Automated testing harness helps.

Use-case tuning

Customer support: Medium endpointing (600-700ms). Caller may pause when explaining.

Transactional (booking, payment): Faster endpointing (450-550ms). Short answers expected.

Discovery / qualification: Medium-long endpointing (700ms). Caller is thinking.

Outbound sales: Fast (500ms). Keep pace.

Language considerations

Some languages have:

Longer average sentences (endpointing too fast = cutoffs).
More filler (mistaken for real speech).
Different speech rhythms.

Tune per language.

Elderly and accessibility

Slower speakers:

Need longer endpointing.
Deserve patience.

Either tune conservatively by default, or detect pace and adapt.

Common pitfalls

Default VAD. Vendor defaults are conservative. Consider tuning.

Same VAD for all use cases. One-size fits nothing perfectly.

Ignoring noise environments. Works in office; fails from cars, offices, kids.

No testing with real audio. Silent quality issues.

Over-aggressive. Caller cut off mid-thought. Complaints.

The mic and network factor

VAD quality depends on:

Microphone quality. Varies widely.
Network quality. Dropped packets hurt.
Audio codec. Some codecs degrade VAD signals.

Client-side quality influences server-side VAD.

Debugging

When VAD fails:

Listen to the audio.
Check VAD output per frame.
Check endpoint decisions.
Identify pattern (always cuts off? only with background noise?).

Fix by tuning parameters.

FAQ

Should we tune VAD per caller? Possible but complex. Usually per-deployment is enough.

Can AI learn optimal endpointing? Research area. Manual tuning dominates in 2026.

What about bilingual callers switching languages? VAD is language-agnostic at frame level; endpointing is too.

How often should we review VAD? Monthly sampling; deeper when adding use cases.

Do VAD tools work over VoIP? Yes — designed for it. Test specifically on your audio pipeline.

Voice Activity Detection in Production Voice Agents

TL;DR

What VAD does

The VAD-endpointing relationship

Tuning windows

VAD algorithms

Handling noise

Where VAD runs

Silence tolerance

False starts and abandonments

Context-aware endpointing

Barge-in detection

VAD quality metrics

Testing VAD

Use-case tuning

Language considerations

Elderly and accessibility

Common pitfalls

The mic and network factor

Debugging

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

Streaming Audio Over WebRTC for Voice Agents

Comparing Neural TTS Architectures

Phoneme-Level Tuning for Voice Agents

Voice AI, twice a month.