🔊 Speech Technology

Voice Activity Detection in Production Voice Agents

Voice Activity Detection — VAD — is the unglamorous infrastructure deciding when the caller has started speaking, when they've paused, and when they're definitively done. It sits upstream of STT, LLM, and TTS, but bad VAD can ruin an otherwise excellent voice agent.

Tyler Weitzman
Tyler Weitzman
March 15, 2026 · 5 min read
Speechify

Voice Activity Detection — VAD — is the unglamorous infrastructure deciding when the caller has started speaking, when they've paused, and when they're definitively done. It sits upstream of STT, LLM, and TTS, but bad VAD can ruin an otherwise excellent voice agent. Get VAD too aggressive and you cut off mid-sentence. Too conservative and the response feels sluggish. Good VAD is one of the most underrated quality levers.

TL;DR

  • VAD distinguishes speech from non-speech in the audio stream.
  • Endpointing (VAD + timing) decides when caller finished.
  • Tune aggressively for fast conversation; conservatively for information capture.
  • Modern VAD uses lightweight neural networks; works well on phone audio.
  • Bad VAD manifests as interruption or sluggishness; test both.

What VAD does

Continuously:

  • Analyzes audio stream for speech.
  • Outputs per-frame speech/silence probability.
  • Feeds endpointing logic.

Endpointing:

  • When speech started → feed STT.
  • When speech paused → wait.
  • When silence persists → endpoint (caller done).

The VAD-endpointing relationship

VAD: low-level (this frame is speech? yes/no).

Endpointing: higher-level (caller is done speaking? yes/no/maybe).

Endpointing uses VAD output plus timing logic:

  • How much silence = end?
  • What counts as "speech returned"?
  • Confidence thresholds.

Tuning windows

Silence-to-endpoint:

  • Fast (300-500ms): quick response, cuts off thinking callers.
  • Medium (500-700ms): balanced.
  • Slow (700-1000ms): patient, feels sluggish.

Pick per use case.

VAD algorithms

Energy-based. Simple. Threshold on audio energy. Susceptible to noise.

Neural VAD. Modern. Small neural network classifies frames. Robust to noise.

Combined. Energy as first-pass, neural for confidence.

Most production VAD is neural. Examples: Silero VAD, WebRTC VAD (older).

Handling noise

Common challenges:

  • Background conversation. Detected as speech.
  • Keyboard typing.
  • Music on hold.
  • Dogs barking, TV, traffic.

Good VAD filters noise; bad VAD interprets as speech, delaying endpoints.

See how background noise affects voice agent accuracy.

Where VAD runs

Client-side: In the voice agent's endpoint (Twilio layer, Vapi, etc.). Lowest latency.

Server-side: STT engine handles internally. Less control.

Hybrid: Client detects speech start; server endpoints. Optimal.

Silence tolerance

Callers pause mid-thought:

  • Brief (under 300ms): part of their utterance.
  • Medium (300-600ms): usually thinking.
  • Long (600-1000ms): often done.

Endpointing logic treats these differently.

False starts and abandonments

Caller says "I want to..." then pauses:

  • VAD detects speech end after pause.
  • Endpointing fires.
  • Agent responds to "I want to."
  • Caller continues: "...book an appointment."

Handling:

  • Accept the interruption as a new turn.
  • Or extend endpoint window when speech was clearly incomplete.

Context-aware endpointing

More sophisticated:

  • Classifier analyzes the transcript.
  • If sentence feels incomplete: wait longer.
  • If sentence feels complete: endpoint fast.

"I want to book" → probably incomplete → extend. "Yes." → complete → short endpoint.

Barge-in detection

Caller speaks while agent is speaking:

  • VAD detects caller speech during TTS.
  • System stops TTS (near-instant).
  • System processes new caller input.

See turn-taking and barge-in: the mechanics of natural conversation.

VAD quality metrics

  • False positive rate. Triggers on non-speech.
  • False negative rate. Misses real speech.
  • Latency. Time from speech start to VAD detection.
  • Robustness to noise.

Balance these per use case.

Testing VAD

  • Sample real calls across various environments.
  • Verify endpoints match caller intent.
  • Check for mid-sentence cutoffs.
  • Check for delayed responses.
  • Noise scenarios.

Automated testing harness helps.

Use-case tuning

Customer support: Medium endpointing (600-700ms). Caller may pause when explaining.

Transactional (booking, payment): Faster endpointing (450-550ms). Short answers expected.

Discovery / qualification: Medium-long endpointing (700ms). Caller is thinking.

Outbound sales: Fast (500ms). Keep pace.

Language considerations

Some languages have:

  • Longer average sentences (endpointing too fast = cutoffs).
  • More filler (mistaken for real speech).
  • Different speech rhythms.

Tune per language.

Elderly and accessibility

Slower speakers:

  • Need longer endpointing.
  • Deserve patience.

Either tune conservatively by default, or detect pace and adapt.

Common pitfalls

Default VAD. Vendor defaults are conservative. Consider tuning.

Same VAD for all use cases. One-size fits nothing perfectly.

Ignoring noise environments. Works in office; fails from cars, offices, kids.

No testing with real audio. Silent quality issues.

Over-aggressive. Caller cut off mid-thought. Complaints.

The mic and network factor

VAD quality depends on:

  • Microphone quality. Varies widely.
  • Network quality. Dropped packets hurt.
  • Audio codec. Some codecs degrade VAD signals.

Client-side quality influences server-side VAD.

Debugging

When VAD fails:

  • Listen to the audio.
  • Check VAD output per frame.
  • Check endpoint decisions.
  • Identify pattern (always cuts off? only with background noise?).

Fix by tuning parameters.

FAQ

Should we tune VAD per caller? Possible but complex. Usually per-deployment is enough.

Can AI learn optimal endpointing? Research area. Manual tuning dominates in 2026.

What about bilingual callers switching languages? VAD is language-agnostic at frame level; endpointing is too.

How often should we review VAD? Monthly sampling; deeper when adding use cases.

Do VAD tools work over VoIP? Yes — designed for it. Test specifically on your audio pipeline.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all →

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.