πŸŽ™οΈ Voice AI Fundamentals

How Voice Agents Handle Interruptions Gracefully

Interruption handling is the single most-felt UX detail in voice AI. Done well, the agent feels conversational and responsive. Done poorly, the agent runs over you, doesn't notice, and you end up shouting at your phone. This is the engineering and design behind getting it right.

Tyler Weitzman
Tyler Weitzman
January 4, 2026 Β· 6 min read
Speechify

Interruption handling is the single most-felt UX detail in voice AI. Done well, the agent feels conversational and responsive. Done poorly, the agent runs over you, doesn't notice, and you end up shouting at your phone. This is the engineering and design behind getting it right.

TL;DR

  • Barge-in detection runs in parallel with TTS playback β€” the agent must keep listening even while it's talking.
  • The hardest part isn't detection; it's the cleanup (stopping audio, cancelling LLM tokens, updating state).
  • Sub-200ms barge-in response is the bar. Anything slower and the caller has to repeat themselves.
  • Different telephony providers behave differently. Twilio, Plivo, and SIP all need slightly different handling.

What "graceful" means

Three things have to happen the moment the user starts talking over the agent:

  1. The agent's audio stops within 200ms. Not 500ms. Not "the next sentence." Now.
  2. The LLM stops generating. No orphan tokens that get half-spoken on the next turn.
  3. The conversation state knows the agent was interrupted. The transcript should reflect the partial utterance, not the full intended one.

If any of these fails, the result is awkward. The agent talks over the user, or starts the next turn with the leftover sentence, or tries to defend a position the user already moved past.

The detection side

Detecting barge-in is mostly a VAD problem. Even while the agent is talking, you keep a separate listener watching the input audio. When the user's microphone level crosses a threshold for more than ~100ms, you fire a barge-in event.

The complications:

  • Echo cancellation. If the agent's audio is bleeding back into the microphone (common on speakerphone), you need echo cancellation to filter it out before VAD. Otherwise the agent triggers its own barge-in.
  • Background noise. A barking dog or kid in the room shouldn't fire barge-in. Most VAD systems handle this with a higher threshold and a brief sustain check.
  • Cough/sneeze. Same problem; same solution.

Production-grade voice agents tune these thresholds for their channel β€” phone audio is more forgiving than browser microphone audio.

The cleanup side

This is where most implementations fall apart. When barge-in fires, you need to:

Stop the TTS stream. Cancel the WebSocket or SSE connection that's feeding audio chunks to the telephony layer. If the chunks are already in flight, this only stops future audio.

Flush the telephony buffer. Twilio's <Stop> verb does this; Plivo has a similar mechanism; SIP requires sending an audio "stop" signal. Without this, the agent keeps talking for 300–500ms after barge-in fires because the audio is buffered at the provider's edge.

Cancel the LLM generation. If the LLM is still streaming tokens, cancel that stream. Otherwise you waste compute and risk orphan tokens.

Update conversation state. Mark the agent's last turn as "interrupted at character N" so the next turn knows what the user actually heard. This matters because the LLM's next reply should not assume the user heard the full sentence.

We have a deeper piece on the anatomy of a voice agent pipeline that covers where barge-in fits in the larger architecture.

What the user should hear

When barge-in is handled well, the user experience is:

  • They start talking.
  • The agent stops talking within ~150ms.
  • A brief silence (~200ms) while the agent processes.
  • The agent's reply addresses what the user just said.

What the user should not hear:

  • The agent talking over them for 500+ms after they start.
  • A weird audio cut-off mid-syllable that sounds like a glitch.
  • The agent's next turn starting with the leftover sentence ("…and that's why we recommend Plan B. Sorry, what did you say?").

Different telephony providers, different quirks

A non-exhaustive list of provider-specific behavior I've seen in production:

  • Twilio. Barge-in works cleanly via the streaming media endpoint. The <Stop> verb on a stream call drops buffered audio; without it, you get the 500ms tail.
  • Plivo. Similar to Twilio but the buffer flush is on a different control message.
  • Bandwidth. Robust but the API surface is smaller; some teams have to manually manage the audio stream lifecycle.
  • Bring-your-own-SIP. Most flexible but requires you to implement the audio buffer management yourself.
  • WebRTC. Cleanest because you control the full audio pipeline; barge-in is essentially instant.

The takeaway: assume your barge-in implementation will need provider-specific tuning. Test on real calls, not just simulated ones.

When barge-in should not fire

A good system distinguishes "the user is talking" from "there's noise on the line." False positives β€” the agent stopping when it shouldn't β€” are nearly as bad as false negatives.

Common cases where barge-in should be suppressed:

  • The user says "uh-huh" or "mm-hmm" as a backchannel to confirm they're listening. The agent should keep talking.
  • Brief environmental sounds (door close, dog bark).
  • The agent is in the middle of reading back a critical confirmation ("your order number is 4521..."). Some agents intentionally suppress barge-in for confirmations.

Most platforms expose a "barge-in sensitivity" setting for this. Tuning it for your channel and use case is a real piece of work.

The acknowledgment trick

A subtle move some voice agents use: when barge-in fires, the agent emits a quick acknowledgment ("oh, sorry β€” go ahead") before listening. This makes the interruption feel intentional rather than glitchy. Used sparingly, it's nice. Overdone, it's annoying.

The right cadence is roughly: acknowledge on ~30% of barge-ins, especially the longer ones. For short corrections ("no, the first one"), no acknowledgment needed.

FAQ

How fast should barge-in be? Sub-200ms is the bar. Above 300ms and callers complain.

Can I disable barge-in for sensitive confirmations? Yes β€” most platforms let you suppress barge-in for specific TTS segments. Use this sparingly; users are quick to learn that the agent can be interrupted, and disabling it feels off.

Why does my agent keep getting barge-in from its own audio? Echo cancellation isn't working on the input stream. This is common on speakerphone calls. The fix is in the audio capture layer, not the agent logic.

What about backchannel sounds like "uh-huh"? Use a barge-in sensitivity that requires longer or louder utterances to fire. Most platforms expose this as a knob.

Does barge-in work in browser-based voice agents (no phone)? Yes β€” the implementation is actually cleaner over WebRTC because you control the full audio pipeline.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems β€” text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all β†’

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub β€” new articles, trend notes, and operator guides. No spam.