๐Ÿ”Š Speech Technology

Echo Cancellation in Real-Time Voice AI

Echo in voice agent calls sounds like this: agent starts speaking, caller's speaker plays agent's voice, caller's microphone picks up agent's voice, the audio flows back to the agent, agent's STT transcribes its own speech, agent gets confused, conversation breaks down.

Tyler Weitzman
Tyler Weitzman
March 17, 2026 ยท 5 min read
Speechify

Echo in voice agent calls sounds like this: agent starts speaking, caller's speaker plays agent's voice, caller's microphone picks up agent's voice, the audio flows back to the agent, agent's STT transcribes its own speech, agent gets confused, conversation breaks down. Echo cancellation (AEC) stops this cascade. Every production voice agent deployment depends on AEC working correctly โ€” and when it doesn't, the symptoms are confusing and urgent.

TL;DR

  • Echo happens when agent speech plays through caller speakers and re-enters caller mic.
  • AEC subtracts expected playback from microphone input.
  • WebRTC has built-in AEC; SIP/PSTN usually handled by carrier.
  • When AEC fails: agent "hears itself," transcribes its own words, breaks flow.
  • Common on speakerphone, Bluetooth, and some mobile setups.

The problem

Voice call:

  1. Agent speaks: "Your appointment is Thursday."
  2. Audio plays through caller's speaker.
  3. Caller's microphone picks up the playback.
  4. Mic audio sent back to agent.
  5. Agent's STT transcribes: "Your appointment is Thursday."
  6. Agent's LLM now sees its own words as caller input.
  7. Agent responds based on echo โ†’ confusion spiral.

How AEC works

AEC is a DSP technique:

  • Know what's being played to caller (reference signal).
  • Subtract (predicted) echo from incoming mic audio.
  • Remaining signal is caller's actual speech.

Effective AEC removes 20-40dB of echo, making it negligible.

Where AEC runs

Caller's device:

  • Smartphones have built-in AEC.
  • Most landline phones have analog AEC.
  • Some cheap speakerphones don't.

Browser (WebRTC):

  • Built-in AEC in browsers.
  • Effective for most use cases.

Carrier / network:

  • Some AEC at carrier level.
  • Varies by provider.

Voice AI platform:

  • May apply additional AEC.
  • Server-side processing.

Multiple layers; often all applied.

When AEC fails

Common scenarios:

Speakerphone with poor AEC. Budget speakerphones or old devices. Echo leaks.

Bluetooth with latency issues. Audio delay confuses AEC.

Simultaneous talk. AEC struggles during double-talk.

Multiple audio devices. Laptop mic + external speaker mismatched.

Symptoms

  • Agent "hears itself" โ€” transcribes its own words as caller input.
  • Agent responds to its own statements ("You said Thursday... which is right?").
  • Conversation loops.
  • Delays and confusion.

Often caught by listening to calls; sometimes only noticed by strange transcripts.

Double-talk handling

When both parties speak simultaneously:

  • AEC can't cleanly separate.
  • Mic picks up both.
  • STT hears both.

Mitigation:

  • Half-duplex mode (one speaker at a time) โ€” loses natural feel.
  • Smart VAD that detects caller-talk-over-agent and prioritizes.
  • Specialized AEC algorithms that handle double-talk.

Common issue in voice AI; imperfectly solved.

The barge-in interaction

When caller barges in:

  • VAD detects caller speech.
  • AEC active; but agent is also speaking.
  • Need to quickly stop agent TTS.
  • Once stopped, AEC easier.

See turn-taking and barge-in: the mechanics of natural conversation.

WebRTC AEC

WebRTC includes reasonable AEC:

  • Chrome, Firefox, Safari all implement.
  • Configurable aggressiveness.
  • Usually good enough.

For voice agents delivering via WebRTC, AEC "just works" mostly.

SIP/PSTN AEC

  • PSTN carriers apply AEC in the network.
  • Traditional telephony assumes decent AEC.
  • Modern SIP usually relies on endpoint AEC.

Mostly handled; occasional gaps.

Testing

Specific tests:

  • Speakerphone call. Known echo challenge.
  • Simultaneous speech. Double-talk.
  • Short turns. Rapid back-and-forth.
  • Long agent sentences. Extended playback.

Monitor for echo artifacts in transcripts.

Detecting echo in transcripts

Signs:

  • Agent's recent utterances appearing in "caller" transcript.
  • Unusual transcription patterns after agent speech.
  • Confusion in LLM responses ("why am I answering my own question?").

Flag for review.

AEC vs noise suppression

Different things:

  • AEC: removes playback-induced echo.
  • Noise suppression: removes background noise.

Both matter; both typically applied.

Audio path considerations

Echo is more likely:

  • Speakerphone (loudspeaker + open mic).
  • Conference room speakerphones.
  • Cheap Bluetooth headsets.
  • Laptop speakers with built-in mic.

Less likely:

  • Headsets (isolated).
  • Handset (held to ear).
  • Quality Bluetooth.

Voice AI platform responsibilities

Good platforms:

  • Apply AEC at ingress.
  • Detect echo patterns.
  • Log for debugging.
  • Tune for common environments.

Bad platforms:

  • Assume client AEC sufficient.
  • No monitoring.
  • Surprised by echo in production.

Common pitfalls

Assuming "it's fine." Echo issues silent until caller complains.

No detection. Agent processes echo as caller input; weird responses.

No double-talk handling. Interruption = both voices, both transcribed.

Latency affecting AEC. Bluetooth latency confuses AEC.

No testing over speakerphone. Works with headsets; fails in car.

Real-time monitoring

  • Echo metric per call.
  • Alert on echo detection.
  • Drill into problematic calls.

Most platforms don't surface this; custom instrumentation needed.

Mitigations

When echo detected:

  • Reduce agent volume (if controllable).
  • Ask caller: "Can you move the phone closer to your ear?"
  • Suggest headphones.
  • Fallback path for persistent echo.

The multi-device scenario

Caller has:

  • Laptop open with browser voice widget.
  • Also phone with SMS notifications.

If agent triggers phone SMS while voice call active, potential multi-audio confusion.

Edge case but real.

FAQ

Does AEC add latency? Minimal; AEC is real-time.

What if caller is on speakerphone in a car? Moderate echo challenge + road noise. Double trouble. Suggest alternative.

Can we disable AEC? Generally no reason to. Leave it on.

Why does echo still happen with modern devices? Edge cases: old gear, Bluetooth issues, double-talk.

How do we test AEC effectiveness? Play tone into agent's audio; listen to mic input; measure echo return level.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ€” text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all โ†’

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub โ€” new articles, trend notes, and operator guides. No spam.