Echo Cancellation in Real-Time Voice AI
Echo in voice agent calls sounds like this: agent starts speaking, caller's speaker plays agent's voice, caller's microphone picks up agent's voice, the audio flows back to the agent, agent's STT transcribes its own speech, agent gets confused, conversation breaks down.
Echo in voice agent calls sounds like this: agent starts speaking, caller's speaker plays agent's voice, caller's microphone picks up agent's voice, the audio flows back to the agent, agent's STT transcribes its own speech, agent gets confused, conversation breaks down. Echo cancellation (AEC) stops this cascade. Every production voice agent deployment depends on AEC working correctly โ and when it doesn't, the symptoms are confusing and urgent.
TL;DR
- Echo happens when agent speech plays through caller speakers and re-enters caller mic.
- AEC subtracts expected playback from microphone input.
- WebRTC has built-in AEC; SIP/PSTN usually handled by carrier.
- When AEC fails: agent "hears itself," transcribes its own words, breaks flow.
- Common on speakerphone, Bluetooth, and some mobile setups.
The problem
Voice call:
- Agent speaks: "Your appointment is Thursday."
- Audio plays through caller's speaker.
- Caller's microphone picks up the playback.
- Mic audio sent back to agent.
- Agent's STT transcribes: "Your appointment is Thursday."
- Agent's LLM now sees its own words as caller input.
- Agent responds based on echo โ confusion spiral.
How AEC works
AEC is a DSP technique:
- Know what's being played to caller (reference signal).
- Subtract (predicted) echo from incoming mic audio.
- Remaining signal is caller's actual speech.
Effective AEC removes 20-40dB of echo, making it negligible.
Where AEC runs
Caller's device:
- Smartphones have built-in AEC.
- Most landline phones have analog AEC.
- Some cheap speakerphones don't.
Browser (WebRTC):
- Built-in AEC in browsers.
- Effective for most use cases.
Carrier / network:
- Some AEC at carrier level.
- Varies by provider.
Voice AI platform:
- May apply additional AEC.
- Server-side processing.
Multiple layers; often all applied.
When AEC fails
Common scenarios:
Speakerphone with poor AEC. Budget speakerphones or old devices. Echo leaks.
Bluetooth with latency issues. Audio delay confuses AEC.
Simultaneous talk. AEC struggles during double-talk.
Multiple audio devices. Laptop mic + external speaker mismatched.
Symptoms
- Agent "hears itself" โ transcribes its own words as caller input.
- Agent responds to its own statements ("You said Thursday... which is right?").
- Conversation loops.
- Delays and confusion.
Often caught by listening to calls; sometimes only noticed by strange transcripts.
Double-talk handling
When both parties speak simultaneously:
- AEC can't cleanly separate.
- Mic picks up both.
- STT hears both.
Mitigation:
- Half-duplex mode (one speaker at a time) โ loses natural feel.
- Smart VAD that detects caller-talk-over-agent and prioritizes.
- Specialized AEC algorithms that handle double-talk.
Common issue in voice AI; imperfectly solved.
The barge-in interaction
When caller barges in:
- VAD detects caller speech.
- AEC active; but agent is also speaking.
- Need to quickly stop agent TTS.
- Once stopped, AEC easier.
See turn-taking and barge-in: the mechanics of natural conversation.
WebRTC AEC
WebRTC includes reasonable AEC:
- Chrome, Firefox, Safari all implement.
- Configurable aggressiveness.
- Usually good enough.
For voice agents delivering via WebRTC, AEC "just works" mostly.
SIP/PSTN AEC
- PSTN carriers apply AEC in the network.
- Traditional telephony assumes decent AEC.
- Modern SIP usually relies on endpoint AEC.
Mostly handled; occasional gaps.
Testing
Specific tests:
- Speakerphone call. Known echo challenge.
- Simultaneous speech. Double-talk.
- Short turns. Rapid back-and-forth.
- Long agent sentences. Extended playback.
Monitor for echo artifacts in transcripts.
Detecting echo in transcripts
Signs:
- Agent's recent utterances appearing in "caller" transcript.
- Unusual transcription patterns after agent speech.
- Confusion in LLM responses ("why am I answering my own question?").
Flag for review.
AEC vs noise suppression
Different things:
- AEC: removes playback-induced echo.
- Noise suppression: removes background noise.
Both matter; both typically applied.
Audio path considerations
Echo is more likely:
- Speakerphone (loudspeaker + open mic).
- Conference room speakerphones.
- Cheap Bluetooth headsets.
- Laptop speakers with built-in mic.
Less likely:
- Headsets (isolated).
- Handset (held to ear).
- Quality Bluetooth.
Voice AI platform responsibilities
Good platforms:
- Apply AEC at ingress.
- Detect echo patterns.
- Log for debugging.
- Tune for common environments.
Bad platforms:
- Assume client AEC sufficient.
- No monitoring.
- Surprised by echo in production.
Common pitfalls
Assuming "it's fine." Echo issues silent until caller complains.
No detection. Agent processes echo as caller input; weird responses.
No double-talk handling. Interruption = both voices, both transcribed.
Latency affecting AEC. Bluetooth latency confuses AEC.
No testing over speakerphone. Works with headsets; fails in car.
Real-time monitoring
- Echo metric per call.
- Alert on echo detection.
- Drill into problematic calls.
Most platforms don't surface this; custom instrumentation needed.
Mitigations
When echo detected:
- Reduce agent volume (if controllable).
- Ask caller: "Can you move the phone closer to your ear?"
- Suggest headphones.
- Fallback path for persistent echo.
The multi-device scenario
Caller has:
- Laptop open with browser voice widget.
- Also phone with SMS notifications.
If agent triggers phone SMS while voice call active, potential multi-audio confusion.
Edge case but real.
Related reading
- Latency Engineering for Real-Time Voice Agents
- Streaming Audio Over WebRTC for Voice Agents
- The Engineering Behind Sub-Second Voice Agents
- Text-to-Speech in 2026: The State of the Art
- How to Benchmark a Voice Agent's End-to-End Latency
FAQ
Does AEC add latency? Minimal; AEC is real-time.
What if caller is on speakerphone in a car? Moderate echo challenge + road noise. Double trouble. Suggest alternative.
Can we disable AEC? Generally no reason to. Leave it on.
Why does echo still happen with modern devices? Edge cases: old gear, Bluetooth issues, double-talk.
How do we test AEC effectiveness? Play tone into agent's audio; listen to mic input; measure echo return level.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all โOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport โ lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
The Engineering Behind Sub-Second Voice Agents
Sub-second voice agents โ end-to-end latency under 1000ms from caller speech end to agent speech start โ used to be aspirational. In 2026 it's table stakes for production voice AI, and leading deployments are hitting sub-500ms.
Latency Engineering for Real-Time Voice Agents
Latency is what separates voice agents that feel conversational from those that feel broken. Humans expect responses within 700ms of finishing a sentence โ anything longer triggers a "did they hear me?" reaction. Sub-500ms feels alive. Sub-300ms feels exceptional.
Voice AI, twice a month.
Get the best of the SIMBA resources hub โ new articles, trend notes, and operator guides. No spam.
