🔊 Speech Technology

Speech-to-Text Word Error Rate Explained

Word Error Rate — WER — is the dominant quality metric for speech-to-text. Every STT vendor reports WER. Every evaluation report ranks models by WER. Most voice agent engineers know the term but have at best a fuzzy sense of what the number really means in production.

Tyler Weitzman
Tyler Weitzman
March 9, 2026 · 5 min read
Speechify

Word Error Rate — WER — is the dominant quality metric for speech-to-text. Every STT vendor reports WER. Every evaluation report ranks models by WER. Most voice agent engineers know the term but have at best a fuzzy sense of what the number really means in production. A 5% WER sounds good; in practice it might leave callers frustrated every fifth sentence. Understanding WER — what it measures, what it doesn't, and how to use it — is foundational for voice agent quality engineering.

TL;DR

  • WER = (substitutions + deletions + insertions) / total reference words.
  • Lower is better. 4-8% for English phone calls is typical in 2026.
  • WER on YOUR audio matters more than published benchmarks.
  • WER doesn't capture semantic errors (right words, wrong meaning).
  • Domain vocabulary biasing reduces WER dramatically.

The formula

WER = (S + D + I) / N

Where:

  • S = substitutions (word heard incorrectly).
  • D = deletions (word not heard at all).
  • I = insertions (extra word heard that wasn't said).
  • N = total words in the reference (ground truth) transcript.

Example:

  • Reference: "I'd like to book an appointment with Dr. Lee tomorrow at 10 AM."
  • Transcribed: "I'd like to book an appointment with doctor lee tomorrow at 10 a m."
  • Substitutions: "Dr. Lee" → "doctor lee" (2 subs).
  • WER = 2/13 = 15%.

Tight definition. Widely reported.

Typical 2026 numbers

  • Clean studio audio: 2-4% WER.
  • Phone-quality (PSTN) audio: 4-8%.
  • Noisy environments (call center background): 8-15%.
  • Strong accents: 10-20% on baseline models; 5-10% on well-tuned.
  • Technical vocabulary without biasing: 15-30%.

Your WER depends on YOUR audio.

Published benchmarks

Vendors publish WER on common datasets:

  • LibriSpeech (clean, read speech).
  • CommonVoice (crowdsourced).
  • Industry-specific (medical, legal).

These benchmarks are useful for relative comparison but rarely match your production audio profile.

Always measure on your own audio.

WER's limitations

Semantic blindness. "I want to book an appointment with Dr. Lee" → "I want to pick an appointment with Dr. Lee." WER: 1/9 = 11%. But "pick" vs "book" — very different meaning.

Domain specificity. Medical transcription where "ibuprofen" is mistranscribed as "nuproven" matters more than generic word errors.

Call-ending errors. If STT mistakes a critical word, the whole call may fail regardless of overall WER.

WER is necessary but not sufficient.

The domain-biasing win

Custom vocabulary / hotwords biasing:

  • Tell STT: "these words will probably appear."
  • Examples: your company name, product names, common industry terms.
  • Can reduce WER 30–50% for domain terms.

Biasing is often the highest-leverage STT optimization.

Streaming WER vs batch WER

Streaming (real-time) STT has slightly higher WER than batch (offline) because it can't see future context.

  • Batch: best accuracy, seconds of lag.
  • Streaming: near-real-time, slightly higher WER.

For voice agents, streaming is mandatory. Accept the 10–20% higher WER vs batch.

Testing WER in production

Methodology:

  • Sample real calls.
  • Ground truth (human transcription).
  • Compare to STT output.
  • Calculate WER.
  • Break down by audio conditions, caller types.

Do this monthly. Trends matter.

Reducing WER

  • Domain vocabulary biasing. Biggest win.
  • Custom language model tuning (some vendors offer).
  • Audio preprocessing (noise reduction, normalization).
  • Model selection — some models better on phone audio than others.
  • Accent-aware models.

Vendor differences

On phone-quality US English 2026:

  • Deepgram Nova-3: 5-7% typical.
  • Whisper (OpenAI): 6-9% typical (streaming variants lower).
  • AssemblyAI: 6-8% typical.
  • Google Cloud Speech: 6-9% typical.
  • Cartesia: 5-8% typical.

All close. Pick based on latency, cost, integration.

The 5% WER in practice

A 5% WER means 5 word errors per 100. For a 3-minute voice call with ~400 spoken words, that's 20 errors. Many will be minor (the, a, um). Some will matter. A few might break the flow.

5% is acceptable. 10% is problematic. 15% is frustrating.

Multilingual WER

Non-English languages have higher WER on most vendors:

  • Spanish: 6-12% typical.
  • French: 8-14%.
  • Mandarin: 8-15%.
  • Less-common languages: highly variable.

Test specifically for your languages.

Accent and demographic considerations

Published WER is often on majority-accent speakers. Reality varies:

  • Non-native English speakers. WER often 2-3x baseline.
  • Regional dialects. Can double WER.
  • Older speakers. Sometimes challenging.
  • Children. Often challenging.

Test on representative demographics of your callers.

See how voice agents handle accents and dialects.

Beyond WER: semantic accuracy

More meaningful metric:

  • Intent classification accuracy from the transcript.
  • Named entity extraction accuracy.
  • Slot-filling accuracy.

5% WER with correct intent > 2% WER with wrong intent.

Noise robustness

Phone calls have noise:

  • Background talk.
  • Music.
  • Traffic.
  • Poor-quality lines.
  • Echoes.

Good STT handles; bad STT fails. Test with noisy audio specifically.

See how background noise affects voice agent accuracy.

Sample-rate considerations

STT is tuned for specific sample rates:

  • Phone audio: 8 kHz (narrowband).
  • HD voice: 16 kHz.
  • WebRTC / modern: 16 or 48 kHz.

Feed correct sample rate to STT. Downsampling is usually fine; upsampling doesn't help.

Common pitfalls

Relying on published WER. Always measure your own.

Ignoring domain vocabulary. Leaving 30-50% WER reduction on the table.

Wrong sample rate. Degraded accuracy silently.

Ignoring accents. Under-serving diverse callers.

WER worship. Optimizing WER but ignoring semantic accuracy.

FAQ

What's a "good" WER target? Under 8% on phone-quality production audio. Lower for domain-heavy use cases.

Does WER vary by call length? Longer calls have similar WER per word. Sometimes slightly higher on clipped starts/ends.

How does WER affect voice agent quality? High WER → LLM gets garbled input → wrong responses. Cascades.

Can we use WER to compare vendors? On a consistent test set, yes. On generic benchmarks, no.

What about homophone errors? "Their / there / they're" — STT handles most by context but errors happen.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all →

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.