How Background Noise Affects Voice Agent Accuracy
Production voice agents live in noisy environments. Callers call from cars, offices, restaurants, kitchens with running faucets, grocery stores with loud music, outdoor job sites. Real audio has sirens, barking dogs, other conversations, and TV in the background.
Production voice agents live in noisy environments. Callers call from cars, offices, restaurants, kitchens with running faucets, grocery stores with loud music, outdoor job sites. Real audio has sirens, barking dogs, other conversations, and TV in the background. How your voice agent handles noise determines whether it works for 90% of callers or 30%. The engineering isn't exotic โ noise-robust STT, good VAD tuning, and graceful degradation โ but it's often overlooked.
TL;DR
- Noise degrades STT (higher WER) and VAD (false endpointing).
- Lab audio is clean; real audio is noisy. Test with real.
- Modern STT is robust to moderate noise; poor in severe.
- Noise suppression at source or transit helps.
- Design for graceful degradation when noise dominates.
Noise impact on STT
Clean audio WER: 4-6%. Moderate noise: 8-12%. Heavy noise: 15-30%+.
Specific noise types:
- Other voices: highly disruptive. STT may transcribe wrong speaker.
- Music: degrades significantly.
- Mechanical noise: less disruptive.
- Echoes: very disruptive.
- Wind/outdoor: moderate.
Noise impact on VAD
- Mechanical hum: VAD may classify as speech, never endpoint.
- Music: VAD confuses as speech.
- Other voices: false positives, fragmented endpoints.
Bad VAD in noisy environments = broken call flow.
Noise suppression techniques
Client-side:
- Noise cancellation in smartphones.
- Headset with noise suppression.
- WebRTC built-in suppression.
Network-side:
- SBC or media proxy can apply suppression.
- Custom DSP in pipeline.
Server-side:
- STT with noise-robust training.
- Post-STT correction.
Modern STT noise resilience
2026 STT handles moderate noise well:
- Trained on noisy audio.
- Uses contextual cues.
- Acoustic modeling robust to interference.
Moderate = coffee shop, office, moving car with windows up.
Struggles:
- Severe noise (construction site).
- Close multi-speaker (family at dinner table).
- Low-quality mics.
Detection and response
Best practice: detect noise, adapt:
- Measure SNR (signal-to-noise ratio). Per call or per segment.
- Flag low-SNR calls.
- Agent adapts: "I'm having trouble hearing โ can you move to a quieter spot?"
Rarely implemented; high-impact where it is.
Acoustic echo cancellation
Echo (agent's voice coming back through caller's mic) is special:
- Normally handled by WebRTC AEC or carrier-side.
- Failures cause agent to hear and process its own speech.
See echo cancellation in real-time voice AI.
Graceful degradation
When noise is bad:
- Acknowledge: "I'm having trouble hearing you."
- Suggest: "Can you move to a quieter spot?"
- Persistent issue: "Let me have someone call you back from a better line."
Don't silently struggle.
Text-based alternatives
If audio is bad:
- Text fallback: "Can you text us the details?"
- Hang up and try: "Let me call back โ the line is bad."
Multi-modal options.
Vendor comparison for noise
Testing on noisy audio:
- Deepgram Nova-3: strong noise robustness.
- Whisper: decent.
- AssemblyAI: decent.
Differences are small but consistent. Test on your noise profile.
Phone-line specifics
PSTN adds:
- Narrowband audio (3.4 kHz cutoff).
- Compression artifacts.
- Line static.
- Static from landline cords.
These degrade STT even before environmental noise.
Cellular noise
Mobile callers:
- Wind outside.
- Cellular handoff artifacts.
- Battery-affected mic quality.
- Dropped packet re-insertion.
Often worse than landline.
Recording and analysis
For noisy deployments:
- Sample noisy calls.
- Analyze failure modes.
- Identify patterns (time of day, caller demographics, environments).
- Deploy mitigations.
Specific environments
Offices: Moderate noise. STT handles.
Cars: Road noise moderate; wind noise harder.
Restaurants: Loud. Challenging.
Construction sites: Often too loud. Fallback needed.
Kitchens: Water, appliances. Variable.
Outdoor: Wind is the killer.
Mitigation at the source
Encourage quiet environments:
- Pre-call message: "For the best experience, please call from a quiet location."
- Auto-detect and suggest during call.
- Scheduled callbacks when noise is bad.
Sample rate considerations
Narrowband (8 kHz) has less high-frequency noise captured than wideband โ sometimes actually helps with high-pitched noise.
Post-processing
Post-STT:
- Confidence filtering. Low-confidence transcripts โ ask caller to repeat.
- LLM context. Use context to infer unclear words.
Common pitfalls
Testing with clean audio. Silent quality issues in production.
No noise handling. Agent struggles silently.
Over-confidence in STT. Noisy speech transcribed wrong; agent responds wrongly.
No degradation path. When noise is bad, agent keeps trying; user frustrated.
Ignoring low-confidence scores. STT confidence signals useful.
Related reading
- How STT Handles Disfluencies and Filler Words
- Speech-to-Text Word Error Rate Explained
- Text-to-Speech in 2026: The State of the Art
- Latency Engineering for Real-Time Voice Agents
FAQ
Can we detect noise and adapt real-time? Yes โ SNR monitoring and prompt variation.
What about callers on speakerphone? Echo risks higher. Acoustic echo cancellation critical.
Do Bluetooth headsets help? Usually yes โ noise suppression built in.
Can AI filter noise in real-time? Some pipelines yes. Costs latency.
How do we know noise is the problem? Sample calls and measure. SNR metric per call.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all โOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
How STT Handles Disfluencies and Filler Words
Real speech is messy. People say "um," "uh," "like," and "you know" constantly. They start sentences and abandon them. They repeat themselves. They mumble and correct.
Speech-to-Text Word Error Rate Explained
Word Error Rate โ WER โ is the dominant quality metric for speech-to-text. Every STT vendor reports WER. Every evaluation report ranks models by WER. Most voice agent engineers know the term but have at best a fuzzy sense of what the number really means in production.
Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport โ lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
Voice AI, twice a month.
Get the best of the SIMBA resources hub โ new articles, trend notes, and operator guides. No spam.
