How Voice Agents Handle Interruptions Gracefully
Interruption handling is the single most-felt UX detail in voice AI. Done well, the agent feels conversational and responsive. Done poorly, the agent runs over you, doesn't notice, and you end up shouting at your phone. This is the engineering and design behind getting it right.
Interruption handling is the single most-felt UX detail in voice AI. Done well, the agent feels conversational and responsive. Done poorly, the agent runs over you, doesn't notice, and you end up shouting at your phone. This is the engineering and design behind getting it right.
TL;DR
- Barge-in detection runs in parallel with TTS playback β the agent must keep listening even while it's talking.
- The hardest part isn't detection; it's the cleanup (stopping audio, cancelling LLM tokens, updating state).
- Sub-200ms barge-in response is the bar. Anything slower and the caller has to repeat themselves.
- Different telephony providers behave differently. Twilio, Plivo, and SIP all need slightly different handling.
What "graceful" means
Three things have to happen the moment the user starts talking over the agent:
- The agent's audio stops within 200ms. Not 500ms. Not "the next sentence." Now.
- The LLM stops generating. No orphan tokens that get half-spoken on the next turn.
- The conversation state knows the agent was interrupted. The transcript should reflect the partial utterance, not the full intended one.
If any of these fails, the result is awkward. The agent talks over the user, or starts the next turn with the leftover sentence, or tries to defend a position the user already moved past.
The detection side
Detecting barge-in is mostly a VAD problem. Even while the agent is talking, you keep a separate listener watching the input audio. When the user's microphone level crosses a threshold for more than ~100ms, you fire a barge-in event.
The complications:
- Echo cancellation. If the agent's audio is bleeding back into the microphone (common on speakerphone), you need echo cancellation to filter it out before VAD. Otherwise the agent triggers its own barge-in.
- Background noise. A barking dog or kid in the room shouldn't fire barge-in. Most VAD systems handle this with a higher threshold and a brief sustain check.
- Cough/sneeze. Same problem; same solution.
Production-grade voice agents tune these thresholds for their channel β phone audio is more forgiving than browser microphone audio.
The cleanup side
This is where most implementations fall apart. When barge-in fires, you need to:
Stop the TTS stream. Cancel the WebSocket or SSE connection that's feeding audio chunks to the telephony layer. If the chunks are already in flight, this only stops future audio.
Flush the telephony buffer. Twilio's <Stop> verb does this; Plivo has a similar mechanism; SIP requires sending an audio "stop" signal. Without this, the agent keeps talking for 300β500ms after barge-in fires because the audio is buffered at the provider's edge.
Cancel the LLM generation. If the LLM is still streaming tokens, cancel that stream. Otherwise you waste compute and risk orphan tokens.
Update conversation state. Mark the agent's last turn as "interrupted at character N" so the next turn knows what the user actually heard. This matters because the LLM's next reply should not assume the user heard the full sentence.
We have a deeper piece on the anatomy of a voice agent pipeline that covers where barge-in fits in the larger architecture.
What the user should hear
When barge-in is handled well, the user experience is:
- They start talking.
- The agent stops talking within ~150ms.
- A brief silence (~200ms) while the agent processes.
- The agent's reply addresses what the user just said.
What the user should not hear:
- The agent talking over them for 500+ms after they start.
- A weird audio cut-off mid-syllable that sounds like a glitch.
- The agent's next turn starting with the leftover sentence ("β¦and that's why we recommend Plan B. Sorry, what did you say?").
Different telephony providers, different quirks
A non-exhaustive list of provider-specific behavior I've seen in production:
- Twilio. Barge-in works cleanly via the streaming media endpoint. The
<Stop>verb on a stream call drops buffered audio; without it, you get the 500ms tail. - Plivo. Similar to Twilio but the buffer flush is on a different control message.
- Bandwidth. Robust but the API surface is smaller; some teams have to manually manage the audio stream lifecycle.
- Bring-your-own-SIP. Most flexible but requires you to implement the audio buffer management yourself.
- WebRTC. Cleanest because you control the full audio pipeline; barge-in is essentially instant.
The takeaway: assume your barge-in implementation will need provider-specific tuning. Test on real calls, not just simulated ones.
When barge-in should not fire
A good system distinguishes "the user is talking" from "there's noise on the line." False positives β the agent stopping when it shouldn't β are nearly as bad as false negatives.
Common cases where barge-in should be suppressed:
- The user says "uh-huh" or "mm-hmm" as a backchannel to confirm they're listening. The agent should keep talking.
- Brief environmental sounds (door close, dog bark).
- The agent is in the middle of reading back a critical confirmation ("your order number is 4521..."). Some agents intentionally suppress barge-in for confirmations.
Most platforms expose a "barge-in sensitivity" setting for this. Tuning it for your channel and use case is a real piece of work.
The acknowledgment trick
A subtle move some voice agents use: when barge-in fires, the agent emits a quick acknowledgment ("oh, sorry β go ahead") before listening. This makes the interruption feel intentional rather than glitchy. Used sparingly, it's nice. Overdone, it's annoying.
The right cadence is roughly: acknowledge on ~30% of barge-ins, especially the longer ones. For short corrections ("no, the first one"), no acknowledgment needed.
Related reading
- Turn-Taking and Barge-In: The Mechanics of Natural Conversation
- How Voice Agents Decide When to Stop Talking
- What Is a Voice Agent? A 2026 Primer
- How a Conversational Voice Agent Actually Works (Under the Hood)
- The Hidden Complexity of Numbers in Voice Agents
FAQ
How fast should barge-in be? Sub-200ms is the bar. Above 300ms and callers complain.
Can I disable barge-in for sensitive confirmations? Yes β most platforms let you suppress barge-in for specific TTS segments. Use this sparingly; users are quick to learn that the agent can be interrupted, and disabling it feels off.
Why does my agent keep getting barge-in from its own audio? Echo cancellation isn't working on the input stream. This is common on speakerphone calls. The fix is in the audio capture layer, not the agent logic.
What about backchannel sounds like "uh-huh"? Use a barge-in sensitivity that requires longer or louder utterances to fire. Most platforms expose this as a knob.
Does barge-in work in browser-based voice agents (no phone)? Yes β the implementation is actually cleaner over WebRTC because you control the full audio pipeline.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems β text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all βOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Turn-Taking and Barge-In: The Mechanics of Natural Conversation
Two humans on a phone call don't take turns the way a tennis match does. They overlap. They interrupt. They finish each other's sentences. They leave 200ms gaps between turns and call it polite. A voice agent that can't do this β even if every word is correct β feels broken.
How Voice Agents Decide When to Stop Talking
A voice agent that doesn't know when to shut up is one of the most annoying things in software. Even if every word is right, an agent that talks past the moment when the caller wanted to interject feels worse than no agent at all.
Is AI Too Slow for Real Phone Calls? Latency Engineering for Voice Agents
Humans are remarkably sensitive to conversational timing. Add even half a second of unexpected delay and the conversation feels off. Here is how modern voice agents achieve sub-second response times.
Voice AI, twice a month.
Get the best of the SIMBA resources hub β new articles, trend notes, and operator guides. No spam.
