Two humans on a phone call don't take turns the way a tennis match does. They overlap. They interrupt. They finish each other's sentences. They leave 200ms gaps between turns and call it polite. A voice agent that can't do this — even if every word is correct — feels broken.

This piece is about the two least-discussed but most-felt parts of a voice agent: deciding when to talk, and shutting up when interrupted.

TL;DR

Turn-taking is the system that decides whose turn it is to speak next.
Barge-in is the specific case where the user starts talking while the agent is still talking.
Get these right and the agent feels alive. Get them wrong and it feels like a 1990s answering machine.
The hardest part isn't detecting silence — it's distinguishing "thinking pause" from "I'm done."

What turn-taking actually is

In a real conversation, there's a constant negotiation about who has the floor. Linguists who study this call it turn construction. Two signals dominate:

Prosody — falling intonation usually signals a turn ending; rising intonation usually signals a continuation.
Lexical completeness — a complete sentence is more likely to be a turn boundary than an incomplete one.

Humans use both, plus a dozen smaller cues like breath patterns, body language, and shared context. Voice agents have access to roughly the first two, plus voice activity detection (VAD).

The naive approach (and why it fails)

The simplest turn-taking strategy: count silence. After N milliseconds of no audio, assume the user is done and start replying.

For chat, this would be fine — there's no notion of "speaking" so the question doesn't arise. For voice, this approach falls apart fast:

Too short (200–400ms): the agent jumps in mid-sentence whenever the user takes a breath. Feels rude. Caller has to repeat themselves constantly.
Too long (1–2s): every utterance ends with awkward dead air. The conversation feels slow even when latency is otherwise great.

The right number is "it depends." A confident, short utterance ("yes, that's right") needs almost no wait. A complex thought ("I called yesterday and they told me to try a different number, but actually the original number was working, so I'm a bit confused...") needs more.

The endpointer

The real solution is an endpointer — a small model that takes VAD + prosodic features + lexical completeness and outputs "the speaker is probably done." Modern endpointers can hit 200–300ms median delay while staying robust to mid-thought pauses.

Some signals an endpointer uses:

Has the silence lasted past the floor (e.g., 250ms)?
Did the audio just before the silence end with a falling pitch contour?
Is the transcript so far a syntactically complete sentence?
Did the speaker use a "filler that signals continuation" like "um" or "so..."?

A good endpointer cuts the perceived latency of an agent by 200–400ms vs a flat silence threshold — without making the agent feel pushy.

For more on the diagnostic side, see voice activity detection in production voice agents.

Barge-in

Barge-in is when the user starts talking while the agent is talking. Two reasons it happens:

The user has more to say and the agent jumped in early.
The user has heard enough of the agent's reply and wants to redirect.

In both cases, the agent should immediately stop talking, flush its audio buffer, and start listening. This sounds easy. It's not. The hard parts:

The audio buffer. The agent's TTS may have already streamed 500ms of audio to the caller's phone, buffered at the telephony provider. Stopping the LLM doesn't stop the audio. You have to send a "drop buffered audio" command to the telephony layer, which not all providers support cleanly.

The cancellation. The LLM may still be generating tokens that would have been spoken. You need to cancel that generation; otherwise compute keeps running and may produce orphan output that confuses subsequent turns.

The resumption. After barge-in, the conversation state is messy. The agent should know that what it was about to say didn't get said. The transcript should reflect that the agent's last reply was interrupted, not completed. Some platforms handle this; many don't.

A well-tuned barge-in handler is one of the highest-leverage UX wins in voice AI. We have more in how voice agents handle interruptions gracefully.

The "let me check on that" pattern

A subtle but important turn-taking move: when the agent is about to take more than ~1.5 seconds to do something (a slow database lookup, a complex retrieval), it should say something first.

"Let me check on that for you." "One moment — pulling up your account." "Let me look — that's an unusual case."

These bridges keep the conversation alive while the agent works. Without them, the caller hears silence and assumes the line dropped.

The implementation is in the orchestration layer: when the LLM emits a function call that's likely slow, your code sends a quick "let me check" line to TTS first, then waits for the function result, then continues.

Common turn-taking failures

After watching many voice agent deployments, the failures cluster:

The agent doesn't wait for the user to finish. Endpointer is too aggressive. Caller has to repeat themselves.

The agent waits forever. Endpointer is too conservative. Every turn ends with awkward silence.

The agent talks over the user. No barge-in handling. User starts to interrupt, agent keeps droning on for 2 more seconds.

The agent cuts itself off mid-word. Barge-in handling that doesn't gracefully end the current sentence.

The agent doesn't bridge slow operations. Long silences during function calls. Caller assumes the line dropped.

Why this matters more than the model

Most teams obsess over which LLM to use and ignore turn-taking. This is backwards. A mediocre LLM with great turn-taking feels miles better than a great LLM with bad turn-taking. The conversational rhythm is more visceral than the word choices.

If your voice agent feels off and you can't pin down why, start by recording five real calls and listening to the gaps between turns. That's where most "feels off" lives.

FAQ

What's a typical endpointer delay? The leaders are at 250–350ms median. Anything over 600ms feels slow. Anything under 200ms is too aggressive and starts cutting off users.

Can I tune the endpointer myself? Some platforms let you set a silence threshold or a "patience" parameter. The full learned endpointers are usually black boxes. If your platform exposes a knob, the right move is to A/B test on real call data.

Why is barge-in so hard? Because the audio is already in flight by the time you decide to cancel. The fix requires cooperation between the TTS provider and the telephony provider, plus state-management gymnastics in the orchestration layer.

Should the agent ever interrupt the user? Almost never. The one exception: if the user is rambling for 30+ seconds and the agent has detected a clear intent, a polite "got it — let me look that up" interruption can be appropriate. Otherwise, listen.

How do I test turn-taking? Record real calls and listen. Score on three dimensions: did the agent wait for the user to finish? did the agent stop when interrupted? did the agent bridge slow operations? Run this every week.

Turn-Taking and Barge-In: The Mechanics of Natural Conversation

TL;DR

What turn-taking actually is

The naive approach (and why it fails)

The endpointer

Barge-in

The "let me check on that" pattern

Common turn-taking failures

Why this matters more than the model

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

How Voice Agents Handle Interruptions Gracefully

How Voice Agents Decide When to Stop Talking

Is AI Too Slow for Real Phone Calls? Latency Engineering for Voice Agents

Voice AI, twice a month.