๐Ÿ”Š Speech Technology

How STT Handles Disfluencies and Filler Words

Real speech is messy. People say "um," "uh," "like," and "you know" constantly. They start sentences and abandon them. They repeat themselves. They mumble and correct.

Tyler Weitzman
Tyler Weitzman
March 14, 2026 ยท 5 min read
Speechify

Real speech is messy. People say "um," "uh," "like," and "you know" constantly. They start sentences and abandon them. They repeat themselves. They mumble and correct. Modern STT has to handle all of this and produce transcripts that are useful to downstream systems โ€” whether that means keeping the disfluencies in for authenticity or stripping them out for clarity. How your STT handles disfluencies affects LLM input quality and ultimately agent response quality.

TL;DR

  • STT has two modes: verbatim (keep disfluencies) and clean (strip them).
  • Voice agents usually want clean for LLM input.
  • Handling false starts and mid-phrase corrections is harder than filler removal.
  • Whisper and modern STT strip most disfluencies; verbatim STT keeps them.
  • Test with real, messy speech โ€” not clean demos.

Types of disfluencies

Filler words. "um," "uh," "er," "like," "you know," "I mean."

Hesitations. Long pauses within speech.

False starts. "I want to- actually, can you just-"

Repetitions. "I I I wanted to ask about..."

Self-corrections. "Ship it to 42-45... er, 4245 Main Street."

Incomplete sentences. Sentences that trail off.

How STT handles them

Modern STT generally:

  • Strips filler words by default ("um," "uh" removed).
  • Collapses repetitions ("I I wanted" โ†’ "I wanted").
  • Keeps false starts (usually; requires downstream handling).
  • Keeps corrections (both the wrong version and the correction).

Behavior varies by vendor and model.

Verbatim vs clean modes

  • Clean (default): filler stripped, repetitions collapsed.
  • Verbatim: all speech kept including disfluencies.

For voice agents feeding LLMs, clean is usually what you want. Clutter hurts LLM performance.

Some vendors let you toggle.

False start handling

Hardest case. Example:

"Can you ship it to 42- actually, 45 Main Street?"

STT transcribes as:

  • "Can you ship it to 42 actually 45 Main Street"

The LLM needs to figure out that "42" was replaced by "45". Usually manageable.

Self-correction patterns

Common patterns:

  • "actually" โ€” usually signals correction.
  • "no wait" โ€” strong correction signal.
  • "I mean" โ€” often clarification.
  • "let me rephrase" โ€” explicit correction.

LLM prompt should handle: "if user says something then corrects themselves, act on the correction, not the original."

Repeated words

"I I I want..." Usually STT collapses; if it doesn't, LLM handles naturally.

Hesitation and pauses

Mid-sentence pauses:

  • Short (under 500ms): typically part of the utterance; STT continues.
  • Long (over 800ms): may trigger endpointing; caller mid-thought gets cut off.

Tune endpointing to balance responsiveness and patience.

See voice activity detection in production voice agents.

Back-channels

"Mm-hmm," "yeah," "right" from the caller during agent's speech.

Not really disfluencies โ€” they're affirmations. STT may transcribe; agent should recognize and not derail.

Speaker characteristics

Disfluency rate varies:

  • Native speakers: moderate filler usage.
  • Second-language speakers: more filler, slower.
  • Nervous callers: more filler.
  • Fast talkers: compressed speech, less filler.
  • Elderly: sometimes more pauses.

STT should handle all โ€” but quality varies.

LLM robustness

Good system prompts help:

The caller's speech may include:
- Filler words (um, uh)
- False starts ("I want to- actually...")
- Self-corrections

Treat the caller's final-stated intent as the real one.
Ignore filler and false starts.

LLM + good STT = robust handling.

Testing with real data

Demo audio is clean. Real audio is messy. Test with:

  • Actual production calls.
  • Diverse speakers.
  • Background noise.
  • Strong accents.
  • Elderly callers, children.

Expect a lot of disfluencies. Verify pipeline handles.

Vendor differences

Deepgram: clean mode default; verbatim available.

Whisper: strips most filler; sometimes aggressively.

AssemblyAI: offers both modes.

Google Cloud Speech: configurable.

Punctuation insertion

STT adds punctuation:

  • Periods on completed sentences.
  • Commas on pauses.
  • Question marks on rising intonation.

Good punctuation helps LLM parse. Bad punctuation confuses.

Disfluency in different languages

Disfluency patterns vary by language:

  • English: um, uh, like.
  • Spanish: este, pues, o sea.
  • Japanese: eto, ano.
  • French: euh, bah.

Language-specific STT handles its filler set.

Impact on WER

Verbatim STT has higher WER (more words to get right). Clean STT has lower WER (filler is stripped, fewer words).

Use WER consistently โ€” same mode when comparing.

See speech-to-text word error rate explained.

Transcription for later use

If you need verbatim transcripts for compliance or analytics:

  • Run verbatim STT separately.
  • Or store clean + raw audio; re-transcribe later if needed.

Voice agent runtime uses clean; compliance archive uses verbatim.

Common pitfalls

Testing with clean audio. Works in lab; fails with real callers.

LLM confused by false starts. Prompt handling needed.

Aggressive filler stripping. Loses some semantic content ("I mean" sometimes matters).

Ignoring hesitations. Caller thinks; agent interrupts.

Disfluency patterns by demographic. Some callers are disadvantaged by STT quirks.

FAQ

Can STT distinguish between filler and real words? Usually. "Like" as filler vs "I like it" โ€” context disambiguates.

Does disfluency removal affect LLM quality? Usually helps. Cleaner input, better output.

What about stuttering? STT handles varied speech patterns. Severe stuttering may need accessibility alternatives.

Can we customize filler words to remove? Some vendors yes. Usually defaults are good.

How does STT handle laughing, crying, or other non-speech? Sometimes transcribed as "[laughter]" or similar; sometimes ignored.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ€” text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all โ†’

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub โ€” new articles, trend notes, and operator guides. No spam.