A voice agent that doesn't know when to shut up is one of the most annoying things in software. Even if every word is right, an agent that talks past the moment when the caller wanted to interject feels worse than no agent at all. This is about the surprisingly tricky engineering of "stop talking now."

TL;DR

Three signals a voice agent uses to decide when to stop talking: end of generated response, barge-in detection, and explicit "wrap it up" cues from the user.
Most failures come from one of three places: TTS that ignores cancellation, LLMs that keep generating after they should stop, and audio buffers that take 500ms to flush.
Sub-200ms response to barge-in is the bar.

When the agent should stop

Three cases:

1. The reply is finished. Naturally, the agent stops at the end of its generated text. This is the easy case.

2. The caller starts talking. Barge-in. The agent should stop within 200ms.

3. The caller signals "enough." "OK, got it." "Sure, sure." "Yeah." These are short utterances that don't add new info but signal the caller has heard enough.

The first is automatic; the second and third require active detection.

The barge-in case

Barge-in handling is the most-felt of the three. The mechanics:

While TTS is playing, a separate listener watches the input audio for VAD activity.
When VAD fires for >100ms, the system declares barge-in.
The TTS stream is cancelled.
The audio buffered at the telephony provider is flushed.
The LLM generation (if still running) is cancelled.
The conversation state is updated to reflect that the agent's turn was interrupted.

Each of these can fail. The most common failure is step 4 — audio buffered at the telephony edge keeps playing for 300–500ms after the agent "stopped." Fixing this requires provider-specific handling.

Full details in how voice agents handle interruptions gracefully.

The "got it" case

A subtle case: the caller is paying attention, signals understanding with a quick "uh-huh" or "right," but doesn't actually want to take the floor. A naive barge-in handler treats this as an interruption and stops the agent mid-thought. A smart system suppresses barge-in for these short backchannel utterances.

The signal is usually:

Utterance is under 0.5 seconds
Audio energy is moderate (not assertive)
Common backchannel words ("yeah," "right," "mhm")

Tuning the barge-in sensitivity is one of the highest-leverage UX decisions.

When the agent should NOT stop talking

The flip side: cases where the agent should keep going even though something seemed to interrupt:

Caller coughs or sneezes
Background noise (door slamming, dog barking)
Brief overlap that's a backchannel, not a turn attempt
The agent is reading back a critical confirmation ("your order number is...") that you don't want interrupted

Most platforms expose a "barge-in sensitivity" or "minimum interrupt duration" setting. The right number for a phone agent is usually around 200ms; for browser-based agents with cleaner audio, you can go lower.

The "let me finish this sentence" pattern

Some voice agents implement a soft barge-in: when the caller starts talking, the agent finishes the current word (or current syllable) rather than cutting off mid-sound. This sounds smoother but adds 100–300ms of perceived latency to barge-in.

The right choice depends on your audience. For impatient B2C customers, hard cutoff is better. For more formal contexts, the polite "let me finish" approach can feel more natural.

How to measure

Three metrics worth tracking on every agent:

Time-to-stop on barge-in. Median should be under 200ms; p99 under 400ms.
False barge-in rate. Percentage of barge-in events triggered by something other than caller speech (noise, echo, backchannel). Should be under 5%.
Missed barge-in rate. Percentage of caller interruptions the agent failed to respond to. Should be under 1%.

If any of these three are off, you have a turn-taking bug worth fixing.

FAQ

Why is barge-in latency so much harder than other latency? Because the audio is already in flight when the cancel decision is made. You can stop generating new audio instantly, but the audio buffered at Twilio's edge will keep playing for whatever's already queued.

Can I use the same approach for chat agents? Conceptually yes (when the user starts typing while the agent is "typing," the agent should stop). In practice, chat barge-in is much less of a UX issue than voice.

Do all telephony providers handle barge-in cleanup the same way? No. Twilio, Plivo, Bandwidth, and SIP all have different cleanup mechanisms. Test on your specific provider.

What's the relationship between barge-in and endpointer? The endpointer decides when the caller is done; the barge-in handler decides when the agent should stop. They run on different audio streams (input vs output) but both depend on robust VAD.

Is sub-100ms barge-in achievable? Yes, on WebRTC-based agents where you control the full pipeline. Over PSTN it's harder due to the buffered audio.

How Voice Agents Decide When to Stop Talking

TL;DR

When the agent should stop

The barge-in case

The "got it" case

When the agent should NOT stop talking

The "let me finish this sentence" pattern

How to measure

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

How to Measure Voice Agent Quality

First-Time Builder's Guide to Voice Agents

Why Voice AI Will Transform Phone Channels by 2030

Voice AI, twice a month.