How Voice Agents Decide When to Stop Talking
A voice agent that doesn't know when to shut up is one of the most annoying things in software. Even if every word is right, an agent that talks past the moment when the caller wanted to interject feels worse than no agent at all.
A voice agent that doesn't know when to shut up is one of the most annoying things in software. Even if every word is right, an agent that talks past the moment when the caller wanted to interject feels worse than no agent at all. This is about the surprisingly tricky engineering of "stop talking now."
TL;DR
- Three signals a voice agent uses to decide when to stop talking: end of generated response, barge-in detection, and explicit "wrap it up" cues from the user.
- Most failures come from one of three places: TTS that ignores cancellation, LLMs that keep generating after they should stop, and audio buffers that take 500ms to flush.
- Sub-200ms response to barge-in is the bar.
When the agent should stop
Three cases:
1. The reply is finished. Naturally, the agent stops at the end of its generated text. This is the easy case.
2. The caller starts talking. Barge-in. The agent should stop within 200ms.
3. The caller signals "enough." "OK, got it." "Sure, sure." "Yeah." These are short utterances that don't add new info but signal the caller has heard enough.
The first is automatic; the second and third require active detection.
The barge-in case
Barge-in handling is the most-felt of the three. The mechanics:
- While TTS is playing, a separate listener watches the input audio for VAD activity.
- When VAD fires for >100ms, the system declares barge-in.
- The TTS stream is cancelled.
- The audio buffered at the telephony provider is flushed.
- The LLM generation (if still running) is cancelled.
- The conversation state is updated to reflect that the agent's turn was interrupted.
Each of these can fail. The most common failure is step 4 โ audio buffered at the telephony edge keeps playing for 300โ500ms after the agent "stopped." Fixing this requires provider-specific handling.
Full details in how voice agents handle interruptions gracefully.
The "got it" case
A subtle case: the caller is paying attention, signals understanding with a quick "uh-huh" or "right," but doesn't actually want to take the floor. A naive barge-in handler treats this as an interruption and stops the agent mid-thought. A smart system suppresses barge-in for these short backchannel utterances.
The signal is usually:
- Utterance is under 0.5 seconds
- Audio energy is moderate (not assertive)
- Common backchannel words ("yeah," "right," "mhm")
Tuning the barge-in sensitivity is one of the highest-leverage UX decisions.
When the agent should NOT stop talking
The flip side: cases where the agent should keep going even though something seemed to interrupt:
- Caller coughs or sneezes
- Background noise (door slamming, dog barking)
- Brief overlap that's a backchannel, not a turn attempt
- The agent is reading back a critical confirmation ("your order number is...") that you don't want interrupted
Most platforms expose a "barge-in sensitivity" or "minimum interrupt duration" setting. The right number for a phone agent is usually around 200ms; for browser-based agents with cleaner audio, you can go lower.
The "let me finish this sentence" pattern
Some voice agents implement a soft barge-in: when the caller starts talking, the agent finishes the current word (or current syllable) rather than cutting off mid-sound. This sounds smoother but adds 100โ300ms of perceived latency to barge-in.
The right choice depends on your audience. For impatient B2C customers, hard cutoff is better. For more formal contexts, the polite "let me finish" approach can feel more natural.
How to measure
Three metrics worth tracking on every agent:
- Time-to-stop on barge-in. Median should be under 200ms; p99 under 400ms.
- False barge-in rate. Percentage of barge-in events triggered by something other than caller speech (noise, echo, backchannel). Should be under 5%.
- Missed barge-in rate. Percentage of caller interruptions the agent failed to respond to. Should be under 1%.
If any of these three are off, you have a turn-taking bug worth fixing.
Related reading
- What Is a Voice Agent? A 2026 Primer
- How to Measure Voice Agent Quality
- First-Time Builder's Guide to Voice Agents
- Why Voice AI Will Transform Phone Channels by 2030
- Voice Agent Use Cases: A Field Guide
FAQ
Why is barge-in latency so much harder than other latency? Because the audio is already in flight when the cancel decision is made. You can stop generating new audio instantly, but the audio buffered at Twilio's edge will keep playing for whatever's already queued.
Can I use the same approach for chat agents? Conceptually yes (when the user starts typing while the agent is "typing," the agent should stop). In practice, chat barge-in is much less of a UX issue than voice.
Do all telephony providers handle barge-in cleanup the same way? No. Twilio, Plivo, Bandwidth, and SIP all have different cleanup mechanisms. Test on your specific provider.
What's the relationship between barge-in and endpointer? The endpointer decides when the caller is done; the barge-in handler decides when the agent should stop. They run on different audio streams (input vs output) but both depend on robust VAD.
Is sub-100ms barge-in achievable? Yes, on WebRTC-based agents where you control the full pipeline. Over PSTN it's harder due to the buffered audio.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all โOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
How to Measure Voice Agent Quality
Most voice agent teams measure the wrong things. They watch deflection rate and call duration; they ignore the quality of what happened inside the call. The result: agents that look good on dashboards and feel bad on the phone.
First-Time Builder's Guide to Voice Agents
Building your first voice agent is mostly about resisting the urge to overengineer. You don't need to compare 8 LLMs. You don't need to design a multi-agent architecture. You need to get a single bounded agent on the phone, listen to it talk to real humans, and iterate.
Why Voice AI Will Transform Phone Channels by 2030
The phone is not going away. Despite a decade of "the phone is dying" predictions, U.S. consumers still place over 30 billion service calls a year. What's changing is what answers them.
Voice AI, twice a month.
Get the best of the SIMBA resources hub โ new articles, trend notes, and operator guides. No spam.
