Streaming STT: How to Cut Recognition Latency
Non-streaming speech-to-text works for transcription — you submit audio, wait, get a transcript. That pattern is fine for batch use cases but fatal for voice agents.
Non-streaming speech-to-text works for transcription — you submit audio, wait, get a transcript. That pattern is fine for batch use cases but fatal for voice agents. In a voice conversation, every millisecond between when the caller finishes speaking and when the agent responds matters. Streaming STT — emitting partial transcripts continuously during speech — is the foundation of low-latency voice AI. This piece covers how it works, what to tune, and the tradeoffs.
TL;DR
- Streaming STT emits partial transcripts during speech and finalizes after endpoint.
- Final transcript latency after endpoint: 50-150ms target.
- Partials let downstream stages (LLM) start processing before caller finishes.
- VAD and endpointing are the latency bottleneck after raw STT.
- Pick vendor based on latency + WER + language + cost.
Streaming vs batch
Batch:
- Submit complete audio.
- Wait for response.
- Higher accuracy (full context).
- High latency — unusable for voice agents.
Streaming:
- Send audio chunks continuously.
- Receive partial transcripts (updated as audio arrives).
- Receive final transcript when speech ends.
- Marginally lower accuracy than batch.
- Low latency — required for voice agents.
The phases
Audio streaming. Client sends 20-100ms audio chunks.
Partial transcripts. Server emits ongoing best-guess transcripts:
- After 200ms: "I'd like"
- After 400ms: "I'd like to book"
- After 600ms: "I'd like to book an appointment"
Each update refines the prior.
Endpoint detection. Server (or client-side VAD) detects end of speech.
Final transcript. Server finalizes the transcript, often different from last partial (cleaner).
The latency target
- Partials: every 100-200ms during speech. Update latency.
- Final after endpoint: 50-150ms. This is the important one.
Users don't notice partial latency; they notice the gap between finishing speech and agent responding.
VAD + endpointing
The biggest latency after STT itself:
Voice activity detection (VAD): Identifies speech vs silence.
Endpointing: Decides when the caller is definitely done speaking.
Two approaches:
- STT-driven endpointing. Server emits "I think you're done" signal. Simple.
- Client-side VAD. More control; faster but requires local integration.
Tuning:
- Short silence threshold (300ms): fast response, risk of cutting off thinking.
- Long silence threshold (800ms): safe but sluggish.
Typical: 500-700ms silence threshold.
See voice activity detection in production voice agents.
Using partials
Partials let you start downstream work before the caller finishes:
- LLM preprocessing on the partial transcript.
- Intent classification early.
- Function call prefetch if you know what they'll likely need.
Most sophisticated implementations use partials for LLM warmup; full inference on final transcript.
Integration pattern
Client (voice agent):
Establish WebSocket to STT provider.
Stream audio chunks continuously.
Receive partial transcripts.
On endpoint signal:
Final transcript in ~100ms.
Pass final to LLM.
LLM processes → TTS → audio.
Sample rate and format
- 8 kHz: phone (PSTN) narrowband.
- 16 kHz: HD voice, WebRTC default.
- 48 kHz: high quality.
STT engines typically resample internally. Send at phone sample rate for phone calls; no need to upsample.
Bandwidth
Streaming STT requires continuous bandwidth:
- 8 kHz 16-bit: 16 kbps uncompressed.
- Compressed (Opus): 8-32 kbps.
Manageable.
Provider comparison
Rough 2026 snapshot on US English phone audio:
Deepgram Nova-3: fastest, ~40-80ms to final. WER: 5-7%.
AssemblyAI Nano: fast, ~60-100ms. WER: 6-8%.
Google Cloud Speech: moderate, ~80-150ms. WER: 6-9%.
OpenAI Realtime (integrated): ~60-120ms. WER: varies.
Cartesia: fast. WER: 5-8%.
Whisper (cloud) streaming variants: 80-150ms.
Pick based on latency + WER + cost + language.
Domain vocabulary biasing
Pass hotwords / vocabulary to STT:
- Company names.
- Product names.
- Industry terms.
- Proper nouns common in your domain.
Reduces WER dramatically on domain-specific terms.
Multilingual streaming
- Some vendors: specify language upfront.
- Others: auto-detect from first utterance.
- Switch mid-call: harder, usually requires restart.
Consider language detection complexity vs explicit config per call.
Handling false starts
Callers often false-start:
- "I'd like to... actually, let me..."
- STT captures both.
- Downstream handling: LLM interprets; correction handled in context.
No special STT treatment needed typically.
Endpoint tuning
For different contexts:
- Fast conversational: 400-500ms silence threshold.
- Informational (data capture): 700-900ms — let caller think.
- Noisy environments: longer threshold to avoid false endpoints.
Tune per deployment.
Mid-utterance stops
STT sometimes stops mid-phrase due to:
- Brief pause (user breathing).
- Noise (detected as speech end).
- STT engine quirk.
Mitigation: grace period after "endpoint" to confirm. Or use confidence thresholds.
Partial vs final accuracy
Partials can be wrong and get corrected:
- Partial: "I want to book an apartment."
- Final: "I want to book an appointment."
Don't commit to actions on partials — wait for final.
Confidence scores
STT provides per-word confidence:
- High confidence: act on it.
- Low confidence: maybe ask for clarification.
Use for quality flagging, not outright rejection (usually).
Common pitfalls
Non-streaming STT. Kills latency. Don't.
Acting on partials. Brittle. Wait for final.
Default endpointing. Often too conservative. Tune per use case.
Wrong sample rate. Degraded WER.
No vocabulary biasing. Missing major accuracy gains.
Related reading
- Latency Engineering for Real-Time Voice Agents
- How to Benchmark a Voice Agent's End-to-End Latency
- Streaming Audio Over WebRTC for Voice Agents
- Echo Cancellation in Real-Time Voice AI
- How Background Noise Affects Voice Agent Accuracy
FAQ
Can we switch STT vendors mid-call? Not practical. Pick one per deployment.
Does streaming STT support recording? Most do. Some have separate recording API.
What about multilingual detection in-call? Limited. Usually language set at start.
Can we get more context-aware partials? Some engines offer context rescoring. Usually small win.
How do we handle microphone / line quality issues? Acoustic preprocessing (noise suppression, AGC) at client side.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all →Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport — lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
How to Benchmark a Voice Agent's End-to-End Latency
Vendor-reported latency is a lab number. What matters for your voice agent is measured latency in your production environment, under real network conditions, with your actual content.
Echo Cancellation in Real-Time Voice AI
Echo in voice agent calls sounds like this: agent starts speaking, caller's speaker plays agent's voice, caller's microphone picks up agent's voice, the audio flows back to the agent, agent's STT transcribes its own speech, agent gets confused, conversation breaks down.
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
