Word Error Rate — WER — is the dominant quality metric for speech-to-text. Every STT vendor reports WER. Every evaluation report ranks models by WER. Most voice agent engineers know the term but have at best a fuzzy sense of what the number really means in production. A 5% WER sounds good; in practice it might leave callers frustrated every fifth sentence. Understanding WER — what it measures, what it doesn't, and how to use it — is foundational for voice agent quality engineering.

TL;DR

WER = (substitutions + deletions + insertions) / total reference words.
Lower is better. 4-8% for English phone calls is typical in 2026.
WER on YOUR audio matters more than published benchmarks.
WER doesn't capture semantic errors (right words, wrong meaning).
Domain vocabulary biasing reduces WER dramatically.

The formula

WER = (S + D + I) / N

Where:

S = substitutions (word heard incorrectly).
D = deletions (word not heard at all).
I = insertions (extra word heard that wasn't said).
N = total words in the reference (ground truth) transcript.

Example:

Reference: "I'd like to book an appointment with Dr. Lee tomorrow at 10 AM."
Transcribed: "I'd like to book an appointment with doctor lee tomorrow at 10 a m."
Substitutions: "Dr. Lee" → "doctor lee" (2 subs).
WER = 2/13 = 15%.

Tight definition. Widely reported.

Typical 2026 numbers

Clean studio audio: 2-4% WER.
Phone-quality (PSTN) audio: 4-8%.
Noisy environments (call center background): 8-15%.
Strong accents: 10-20% on baseline models; 5-10% on well-tuned.
Technical vocabulary without biasing: 15-30%.

Your WER depends on YOUR audio.

Published benchmarks

Vendors publish WER on common datasets:

LibriSpeech (clean, read speech).
CommonVoice (crowdsourced).
Industry-specific (medical, legal).

These benchmarks are useful for relative comparison but rarely match your production audio profile.

Always measure on your own audio.

WER's limitations

Semantic blindness. "I want to book an appointment with Dr. Lee" → "I want to pick an appointment with Dr. Lee." WER: 1/9 = 11%. But "pick" vs "book" — very different meaning.

Domain specificity. Medical transcription where "ibuprofen" is mistranscribed as "nuproven" matters more than generic word errors.

Call-ending errors. If STT mistakes a critical word, the whole call may fail regardless of overall WER.

WER is necessary but not sufficient.

The domain-biasing win

Custom vocabulary / hotwords biasing:

Tell STT: "these words will probably appear."
Examples: your company name, product names, common industry terms.
Can reduce WER 30–50% for domain terms.

Biasing is often the highest-leverage STT optimization.

Streaming WER vs batch WER

Streaming (real-time) STT has slightly higher WER than batch (offline) because it can't see future context.

Batch: best accuracy, seconds of lag.
Streaming: near-real-time, slightly higher WER.

For voice agents, streaming is mandatory. Accept the 10–20% higher WER vs batch.

Testing WER in production

Methodology:

Sample real calls.
Ground truth (human transcription).
Compare to STT output.
Calculate WER.
Break down by audio conditions, caller types.

Do this monthly. Trends matter.

Reducing WER

Domain vocabulary biasing. Biggest win.
Custom language model tuning (some vendors offer).
Audio preprocessing (noise reduction, normalization).
Model selection — some models better on phone audio than others.
Accent-aware models.

Vendor differences

On phone-quality US English 2026:

Deepgram Nova-3: 5-7% typical.
Whisper (OpenAI): 6-9% typical (streaming variants lower).
AssemblyAI: 6-8% typical.
Google Cloud Speech: 6-9% typical.
Cartesia: 5-8% typical.

All close. Pick based on latency, cost, integration.

The 5% WER in practice

A 5% WER means 5 word errors per 100. For a 3-minute voice call with ~400 spoken words, that's 20 errors. Many will be minor (the, a, um). Some will matter. A few might break the flow.

5% is acceptable. 10% is problematic. 15% is frustrating.

Multilingual WER

Non-English languages have higher WER on most vendors:

Spanish: 6-12% typical.
French: 8-14%.
Mandarin: 8-15%.
Less-common languages: highly variable.

Test specifically for your languages.

Accent and demographic considerations

Published WER is often on majority-accent speakers. Reality varies:

Non-native English speakers. WER often 2-3x baseline.
Regional dialects. Can double WER.
Older speakers. Sometimes challenging.
Children. Often challenging.

Test on representative demographics of your callers.

See how voice agents handle accents and dialects.

Beyond WER: semantic accuracy

More meaningful metric:

Intent classification accuracy from the transcript.
Named entity extraction accuracy.
Slot-filling accuracy.

5% WER with correct intent > 2% WER with wrong intent.

Noise robustness

Phone calls have noise:

Background talk.
Music.
Traffic.
Poor-quality lines.
Echoes.

Good STT handles; bad STT fails. Test with noisy audio specifically.

See how background noise affects voice agent accuracy.

Sample-rate considerations

STT is tuned for specific sample rates:

Phone audio: 8 kHz (narrowband).
HD voice: 16 kHz.
WebRTC / modern: 16 or 48 kHz.

Feed correct sample rate to STT. Downsampling is usually fine; upsampling doesn't help.

Common pitfalls

Relying on published WER. Always measure your own.

Ignoring domain vocabulary. Leaving 30-50% WER reduction on the table.

Wrong sample rate. Degraded accuracy silently.

Ignoring accents. Under-serving diverse callers.

WER worship. Optimizing WER but ignoring semantic accuracy.

FAQ

What's a "good" WER target? Under 8% on phone-quality production audio. Lower for domain-heavy use cases.

Does WER vary by call length? Longer calls have similar WER per word. Sometimes slightly higher on clipped starts/ends.

How does WER affect voice agent quality? High WER → LLM gets garbled input → wrong responses. Cascades.

Can we use WER to compare vendors? On a consistent test set, yes. On generic benchmarks, no.

What about homophone errors? "Their / there / they're" — STT handles most by context but errors happen.

Speech-to-Text Word Error Rate Explained

TL;DR

The formula

Typical 2026 numbers

Published benchmarks

WER's limitations

The domain-biasing win

Streaming WER vs batch WER

Testing WER in production

Reducing WER

Vendor differences

The 5% WER in practice

Multilingual WER

Accent and demographic considerations

Beyond WER: semantic accuracy

Noise robustness

Sample-rate considerations

Common pitfalls

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

How Background Noise Affects Voice Agent Accuracy

How STT Handles Disfluencies and Filler Words

Streaming Audio Over WebRTC for Voice Agents

Voice AI, twice a month.