Speech-to-Text Word Error Rate Explained
Word Error Rate — WER — is the dominant quality metric for speech-to-text. Every STT vendor reports WER. Every evaluation report ranks models by WER. Most voice agent engineers know the term but have at best a fuzzy sense of what the number really means in production.
Word Error Rate — WER — is the dominant quality metric for speech-to-text. Every STT vendor reports WER. Every evaluation report ranks models by WER. Most voice agent engineers know the term but have at best a fuzzy sense of what the number really means in production. A 5% WER sounds good; in practice it might leave callers frustrated every fifth sentence. Understanding WER — what it measures, what it doesn't, and how to use it — is foundational for voice agent quality engineering.
TL;DR
- WER = (substitutions + deletions + insertions) / total reference words.
- Lower is better. 4-8% for English phone calls is typical in 2026.
- WER on YOUR audio matters more than published benchmarks.
- WER doesn't capture semantic errors (right words, wrong meaning).
- Domain vocabulary biasing reduces WER dramatically.
The formula
WER = (S + D + I) / N
Where:
- S = substitutions (word heard incorrectly).
- D = deletions (word not heard at all).
- I = insertions (extra word heard that wasn't said).
- N = total words in the reference (ground truth) transcript.
Example:
- Reference: "I'd like to book an appointment with Dr. Lee tomorrow at 10 AM."
- Transcribed: "I'd like to book an appointment with doctor lee tomorrow at 10 a m."
- Substitutions: "Dr. Lee" → "doctor lee" (2 subs).
- WER = 2/13 = 15%.
Tight definition. Widely reported.
Typical 2026 numbers
- Clean studio audio: 2-4% WER.
- Phone-quality (PSTN) audio: 4-8%.
- Noisy environments (call center background): 8-15%.
- Strong accents: 10-20% on baseline models; 5-10% on well-tuned.
- Technical vocabulary without biasing: 15-30%.
Your WER depends on YOUR audio.
Published benchmarks
Vendors publish WER on common datasets:
- LibriSpeech (clean, read speech).
- CommonVoice (crowdsourced).
- Industry-specific (medical, legal).
These benchmarks are useful for relative comparison but rarely match your production audio profile.
Always measure on your own audio.
WER's limitations
Semantic blindness. "I want to book an appointment with Dr. Lee" → "I want to pick an appointment with Dr. Lee." WER: 1/9 = 11%. But "pick" vs "book" — very different meaning.
Domain specificity. Medical transcription where "ibuprofen" is mistranscribed as "nuproven" matters more than generic word errors.
Call-ending errors. If STT mistakes a critical word, the whole call may fail regardless of overall WER.
WER is necessary but not sufficient.
The domain-biasing win
Custom vocabulary / hotwords biasing:
- Tell STT: "these words will probably appear."
- Examples: your company name, product names, common industry terms.
- Can reduce WER 30–50% for domain terms.
Biasing is often the highest-leverage STT optimization.
Streaming WER vs batch WER
Streaming (real-time) STT has slightly higher WER than batch (offline) because it can't see future context.
- Batch: best accuracy, seconds of lag.
- Streaming: near-real-time, slightly higher WER.
For voice agents, streaming is mandatory. Accept the 10–20% higher WER vs batch.
Testing WER in production
Methodology:
- Sample real calls.
- Ground truth (human transcription).
- Compare to STT output.
- Calculate WER.
- Break down by audio conditions, caller types.
Do this monthly. Trends matter.
Reducing WER
- Domain vocabulary biasing. Biggest win.
- Custom language model tuning (some vendors offer).
- Audio preprocessing (noise reduction, normalization).
- Model selection — some models better on phone audio than others.
- Accent-aware models.
Vendor differences
On phone-quality US English 2026:
- Deepgram Nova-3: 5-7% typical.
- Whisper (OpenAI): 6-9% typical (streaming variants lower).
- AssemblyAI: 6-8% typical.
- Google Cloud Speech: 6-9% typical.
- Cartesia: 5-8% typical.
All close. Pick based on latency, cost, integration.
The 5% WER in practice
A 5% WER means 5 word errors per 100. For a 3-minute voice call with ~400 spoken words, that's 20 errors. Many will be minor (the, a, um). Some will matter. A few might break the flow.
5% is acceptable. 10% is problematic. 15% is frustrating.
Multilingual WER
Non-English languages have higher WER on most vendors:
- Spanish: 6-12% typical.
- French: 8-14%.
- Mandarin: 8-15%.
- Less-common languages: highly variable.
Test specifically for your languages.
Accent and demographic considerations
Published WER is often on majority-accent speakers. Reality varies:
- Non-native English speakers. WER often 2-3x baseline.
- Regional dialects. Can double WER.
- Older speakers. Sometimes challenging.
- Children. Often challenging.
Test on representative demographics of your callers.
See how voice agents handle accents and dialects.
Beyond WER: semantic accuracy
More meaningful metric:
- Intent classification accuracy from the transcript.
- Named entity extraction accuracy.
- Slot-filling accuracy.
5% WER with correct intent > 2% WER with wrong intent.
Noise robustness
Phone calls have noise:
- Background talk.
- Music.
- Traffic.
- Poor-quality lines.
- Echoes.
Good STT handles; bad STT fails. Test with noisy audio specifically.
See how background noise affects voice agent accuracy.
Sample-rate considerations
STT is tuned for specific sample rates:
- Phone audio: 8 kHz (narrowband).
- HD voice: 16 kHz.
- WebRTC / modern: 16 or 48 kHz.
Feed correct sample rate to STT. Downsampling is usually fine; upsampling doesn't help.
Common pitfalls
Relying on published WER. Always measure your own.
Ignoring domain vocabulary. Leaving 30-50% WER reduction on the table.
Wrong sample rate. Degraded accuracy silently.
Ignoring accents. Under-serving diverse callers.
WER worship. Optimizing WER but ignoring semantic accuracy.
FAQ
What's a "good" WER target? Under 8% on phone-quality production audio. Lower for domain-heavy use cases.
Does WER vary by call length? Longer calls have similar WER per word. Sometimes slightly higher on clipped starts/ends.
How does WER affect voice agent quality? High WER → LLM gets garbled input → wrong responses. Cascades.
Can we use WER to compare vendors? On a consistent test set, yes. On generic benchmarks, no.
What about homophone errors? "Their / there / they're" — STT handles most by context but errors happen.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all →Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
How Background Noise Affects Voice Agent Accuracy
Production voice agents live in noisy environments. Callers call from cars, offices, restaurants, kitchens with running faucets, grocery stores with loud music, outdoor job sites. Real audio has sirens, barking dogs, other conversations, and TV in the background.
How STT Handles Disfluencies and Filler Words
Real speech is messy. People say "um," "uh," "like," and "you know" constantly. They start sentences and abandon them. They repeat themselves. They mumble and correct.
Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport — lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
