How STT Handles Disfluencies and Filler Words
Real speech is messy. People say "um," "uh," "like," and "you know" constantly. They start sentences and abandon them. They repeat themselves. They mumble and correct.
Real speech is messy. People say "um," "uh," "like," and "you know" constantly. They start sentences and abandon them. They repeat themselves. They mumble and correct. Modern STT has to handle all of this and produce transcripts that are useful to downstream systems โ whether that means keeping the disfluencies in for authenticity or stripping them out for clarity. How your STT handles disfluencies affects LLM input quality and ultimately agent response quality.
TL;DR
- STT has two modes: verbatim (keep disfluencies) and clean (strip them).
- Voice agents usually want clean for LLM input.
- Handling false starts and mid-phrase corrections is harder than filler removal.
- Whisper and modern STT strip most disfluencies; verbatim STT keeps them.
- Test with real, messy speech โ not clean demos.
Types of disfluencies
Filler words. "um," "uh," "er," "like," "you know," "I mean."
Hesitations. Long pauses within speech.
False starts. "I want to- actually, can you just-"
Repetitions. "I I I wanted to ask about..."
Self-corrections. "Ship it to 42-45... er, 4245 Main Street."
Incomplete sentences. Sentences that trail off.
How STT handles them
Modern STT generally:
- Strips filler words by default ("um," "uh" removed).
- Collapses repetitions ("I I wanted" โ "I wanted").
- Keeps false starts (usually; requires downstream handling).
- Keeps corrections (both the wrong version and the correction).
Behavior varies by vendor and model.
Verbatim vs clean modes
- Clean (default): filler stripped, repetitions collapsed.
- Verbatim: all speech kept including disfluencies.
For voice agents feeding LLMs, clean is usually what you want. Clutter hurts LLM performance.
Some vendors let you toggle.
False start handling
Hardest case. Example:
"Can you ship it to 42- actually, 45 Main Street?"
STT transcribes as:
- "Can you ship it to 42 actually 45 Main Street"
The LLM needs to figure out that "42" was replaced by "45". Usually manageable.
Self-correction patterns
Common patterns:
- "actually" โ usually signals correction.
- "no wait" โ strong correction signal.
- "I mean" โ often clarification.
- "let me rephrase" โ explicit correction.
LLM prompt should handle: "if user says something then corrects themselves, act on the correction, not the original."
Repeated words
"I I I want..." Usually STT collapses; if it doesn't, LLM handles naturally.
Hesitation and pauses
Mid-sentence pauses:
- Short (under 500ms): typically part of the utterance; STT continues.
- Long (over 800ms): may trigger endpointing; caller mid-thought gets cut off.
Tune endpointing to balance responsiveness and patience.
See voice activity detection in production voice agents.
Back-channels
"Mm-hmm," "yeah," "right" from the caller during agent's speech.
Not really disfluencies โ they're affirmations. STT may transcribe; agent should recognize and not derail.
Speaker characteristics
Disfluency rate varies:
- Native speakers: moderate filler usage.
- Second-language speakers: more filler, slower.
- Nervous callers: more filler.
- Fast talkers: compressed speech, less filler.
- Elderly: sometimes more pauses.
STT should handle all โ but quality varies.
LLM robustness
Good system prompts help:
The caller's speech may include:
- Filler words (um, uh)
- False starts ("I want to- actually...")
- Self-corrections
Treat the caller's final-stated intent as the real one.
Ignore filler and false starts.
LLM + good STT = robust handling.
Testing with real data
Demo audio is clean. Real audio is messy. Test with:
- Actual production calls.
- Diverse speakers.
- Background noise.
- Strong accents.
- Elderly callers, children.
Expect a lot of disfluencies. Verify pipeline handles.
Vendor differences
Deepgram: clean mode default; verbatim available.
Whisper: strips most filler; sometimes aggressively.
AssemblyAI: offers both modes.
Google Cloud Speech: configurable.
Punctuation insertion
STT adds punctuation:
- Periods on completed sentences.
- Commas on pauses.
- Question marks on rising intonation.
Good punctuation helps LLM parse. Bad punctuation confuses.
Disfluency in different languages
Disfluency patterns vary by language:
- English: um, uh, like.
- Spanish: este, pues, o sea.
- Japanese: eto, ano.
- French: euh, bah.
Language-specific STT handles its filler set.
Impact on WER
Verbatim STT has higher WER (more words to get right). Clean STT has lower WER (filler is stripped, fewer words).
Use WER consistently โ same mode when comparing.
See speech-to-text word error rate explained.
Transcription for later use
If you need verbatim transcripts for compliance or analytics:
- Run verbatim STT separately.
- Or store clean + raw audio; re-transcribe later if needed.
Voice agent runtime uses clean; compliance archive uses verbatim.
Common pitfalls
Testing with clean audio. Works in lab; fails with real callers.
LLM confused by false starts. Prompt handling needed.
Aggressive filler stripping. Loses some semantic content ("I mean" sometimes matters).
Ignoring hesitations. Caller thinks; agent interrupts.
Disfluency patterns by demographic. Some callers are disadvantaged by STT quirks.
Related reading
- How Background Noise Affects Voice Agent Accuracy
- Text-to-Speech in 2026: The State of the Art
- Latency Engineering for Real-Time Voice Agents
- Streaming Audio Over WebRTC for Voice Agents
FAQ
Can STT distinguish between filler and real words? Usually. "Like" as filler vs "I like it" โ context disambiguates.
Does disfluency removal affect LLM quality? Usually helps. Cleaner input, better output.
What about stuttering? STT handles varied speech patterns. Severe stuttering may need accessibility alternatives.
Can we customize filler words to remove? Some vendors yes. Usually defaults are good.
How does STT handle laughing, crying, or other non-speech? Sometimes transcribed as "[laughter]" or similar; sometimes ignored.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all โOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
How Background Noise Affects Voice Agent Accuracy
Production voice agents live in noisy environments. Callers call from cars, offices, restaurants, kitchens with running faucets, grocery stores with loud music, outdoor job sites. Real audio has sirens, barking dogs, other conversations, and TV in the background.
Speech-to-Text Word Error Rate Explained
Word Error Rate โ WER โ is the dominant quality metric for speech-to-text. Every STT vendor reports WER. Every evaluation report ranks models by WER. Most voice agent engineers know the term but have at best a fuzzy sense of what the number really means in production.
Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport โ lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
Voice AI, twice a month.
Get the best of the SIMBA resources hub โ new articles, trend notes, and operator guides. No spam.
