Non-streaming speech-to-text works for transcription — you submit audio, wait, get a transcript. That pattern is fine for batch use cases but fatal for voice agents. In a voice conversation, every millisecond between when the caller finishes speaking and when the agent responds matters. Streaming STT — emitting partial transcripts continuously during speech — is the foundation of low-latency voice AI. This piece covers how it works, what to tune, and the tradeoffs.

TL;DR

Streaming STT emits partial transcripts during speech and finalizes after endpoint.
Final transcript latency after endpoint: 50-150ms target.
Partials let downstream stages (LLM) start processing before caller finishes.
VAD and endpointing are the latency bottleneck after raw STT.
Pick vendor based on latency + WER + language + cost.

Streaming vs batch

Batch:

Submit complete audio.
Wait for response.
Higher accuracy (full context).
High latency — unusable for voice agents.

Streaming:

Send audio chunks continuously.
Receive partial transcripts (updated as audio arrives).
Receive final transcript when speech ends.
Marginally lower accuracy than batch.
Low latency — required for voice agents.

The phases

Audio streaming. Client sends 20-100ms audio chunks.

Partial transcripts. Server emits ongoing best-guess transcripts:

After 200ms: "I'd like"
After 400ms: "I'd like to book"
After 600ms: "I'd like to book an appointment"

Each update refines the prior.

Endpoint detection. Server (or client-side VAD) detects end of speech.

Final transcript. Server finalizes the transcript, often different from last partial (cleaner).

The latency target

Partials: every 100-200ms during speech. Update latency.
Final after endpoint: 50-150ms. This is the important one.

Users don't notice partial latency; they notice the gap between finishing speech and agent responding.

VAD + endpointing

The biggest latency after STT itself:

Voice activity detection (VAD): Identifies speech vs silence.

Endpointing: Decides when the caller is definitely done speaking.

Two approaches:

STT-driven endpointing. Server emits "I think you're done" signal. Simple.
Client-side VAD. More control; faster but requires local integration.

Tuning:

Short silence threshold (300ms): fast response, risk of cutting off thinking.
Long silence threshold (800ms): safe but sluggish.

Typical: 500-700ms silence threshold.

See voice activity detection in production voice agents.

Using partials

Partials let you start downstream work before the caller finishes:

LLM preprocessing on the partial transcript.
Intent classification early.
Function call prefetch if you know what they'll likely need.

Most sophisticated implementations use partials for LLM warmup; full inference on final transcript.

Integration pattern

Client (voice agent):
  Establish WebSocket to STT provider.
  Stream audio chunks continuously.
  Receive partial transcripts.
  On endpoint signal:
    Final transcript in ~100ms.
    Pass final to LLM.
    LLM processes → TTS → audio.

Sample rate and format

8 kHz: phone (PSTN) narrowband.
16 kHz: HD voice, WebRTC default.
48 kHz: high quality.

STT engines typically resample internally. Send at phone sample rate for phone calls; no need to upsample.

Bandwidth

Streaming STT requires continuous bandwidth:

8 kHz 16-bit: 16 kbps uncompressed.
Compressed (Opus): 8-32 kbps.

Manageable.

Provider comparison

Rough 2026 snapshot on US English phone audio:

Deepgram Nova-3: fastest, ~40-80ms to final. WER: 5-7%.

AssemblyAI Nano: fast, ~60-100ms. WER: 6-8%.

Google Cloud Speech: moderate, ~80-150ms. WER: 6-9%.

OpenAI Realtime (integrated): ~60-120ms. WER: varies.

Cartesia: fast. WER: 5-8%.

Whisper (cloud) streaming variants: 80-150ms.

Pick based on latency + WER + cost + language.

Domain vocabulary biasing

Pass hotwords / vocabulary to STT:

Company names.
Product names.
Industry terms.
Proper nouns common in your domain.

Reduces WER dramatically on domain-specific terms.

Multilingual streaming

Some vendors: specify language upfront.
Others: auto-detect from first utterance.
Switch mid-call: harder, usually requires restart.

Consider language detection complexity vs explicit config per call.

Handling false starts

Callers often false-start:

"I'd like to... actually, let me..."
STT captures both.
Downstream handling: LLM interprets; correction handled in context.

No special STT treatment needed typically.

Endpoint tuning

For different contexts:

Fast conversational: 400-500ms silence threshold.
Informational (data capture): 700-900ms — let caller think.
Noisy environments: longer threshold to avoid false endpoints.

Tune per deployment.

Mid-utterance stops

STT sometimes stops mid-phrase due to:

Brief pause (user breathing).
Noise (detected as speech end).
STT engine quirk.

Mitigation: grace period after "endpoint" to confirm. Or use confidence thresholds.

Partial vs final accuracy

Partials can be wrong and get corrected:

Partial: "I want to book an apartment."
Final: "I want to book an appointment."

Don't commit to actions on partials — wait for final.

Confidence scores

STT provides per-word confidence:

High confidence: act on it.
Low confidence: maybe ask for clarification.

Use for quality flagging, not outright rejection (usually).

Common pitfalls

Non-streaming STT. Kills latency. Don't.

Acting on partials. Brittle. Wait for final.

Default endpointing. Often too conservative. Tune per use case.

Wrong sample rate. Degraded WER.

No vocabulary biasing. Missing major accuracy gains.

FAQ

Can we switch STT vendors mid-call? Not practical. Pick one per deployment.

Does streaming STT support recording? Most do. Some have separate recording API.

What about multilingual detection in-call? Limited. Usually language set at start.

Can we get more context-aware partials? Some engines offer context rescoring. Usually small win.

How do we handle microphone / line quality issues? Acoustic preprocessing (noise suppression, AGC) at client side.

Streaming STT: How to Cut Recognition Latency

TL;DR

Streaming vs batch

The phases

The latency target

VAD + endpointing

Using partials

Integration pattern

Sample rate and format

Bandwidth

Provider comparison

Domain vocabulary biasing

Multilingual streaming

Handling false starts

Endpoint tuning

Mid-utterance stops

Partial vs final accuracy

Confidence scores

Common pitfalls

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

Streaming Audio Over WebRTC for Voice Agents

How to Benchmark a Voice Agent's End-to-End Latency

Echo Cancellation in Real-Time Voice AI

Voice AI, twice a month.