How Sample Rate Affects Voice Agent Quality
Sample rate is one of those low-level audio details that voice agent builders often inherit without thinking about. The STT config says 16 kHz; the TTS outputs 24 kHz; the PSTN leg is 8 kHz.
Sample rate is one of those low-level audio details that voice agent builders often inherit without thinking about. The STT config says 16 kHz; the TTS outputs 24 kHz; the PSTN leg is 8 kHz. Somewhere in the stack, audio gets resampled, and the resampling may or may not preserve quality. Understanding sample rate well enough to debug quality issues is foundational voice engineering.
TL;DR
- Sample rate = audio samples per second. Higher = more detail captured.
- Phone audio is 8 kHz narrowband; WebRTC/HD voice is 16 kHz; studio is 48 kHz.
- Match STT and TTS sample rates to actual audio; don't upsample blindly.
- Downsampling is fine; upsampling adds nothing.
- Inconsistencies cause subtle quality issues.
Sample rate basics
Audio is digitized by sampling the analog waveform. Sample rate determines:
- Frequency range captured. Max captured frequency = half sample rate (Nyquist).
- Audio bandwidth.
For human voice:
- 8 kHz sample rate: captures up to 4 kHz. Cuts off at 3.4 kHz in practice (telephony). Voice intelligible, some detail lost.
- 16 kHz: captures up to 8 kHz. Much better voice quality.
- 48 kHz: captures up to 24 kHz. Music-quality; overkill for voice.
Why phone is 8 kHz
PSTN designed in 1960s for voice (not music). Human speech intelligibility maxes out around 3.4 kHz. 8 kHz sampling captures that with headroom. Bandwidth-efficient for networks of that era.
Modern networks could carry more, but PSTN infrastructure is stuck at 8 kHz for compatibility.
HD voice
16 kHz sampling = HD voice. Much better quality:
- Consonants clearer.
- Sibilants (s, sh) preserved.
- Overall more natural.
Delivered via:
- G.722 codec on SIP/VoIP.
- Opus wideband on WebRTC.
- Modern mobile networks (LTE/5G).
Sample rates in the voice pipeline
Typical paths:
Phone (PSTN) call:
- 8 kHz throughout.
- STT expects 8 kHz.
- TTS outputs 24 kHz, downsampled to 8 kHz for delivery.
WebRTC call:
- 16 kHz wideband throughout.
- STT at 16 kHz.
- TTS at 24 kHz, downsampled to 16 kHz.
Match sample rates; don't fight them.
Resampling
Converting between sample rates:
- Downsampling (higher to lower): removes high-frequency content. Usually fine.
- Upsampling (lower to higher): interpolates. Doesn't add real information. Sometimes required for STT models but doesn't improve quality.
Downsample: fine. Upsample: required sometimes, not beneficial.
STT sample rate
STT models are trained on specific rates:
- Some models trained on 16 kHz; handle 8 kHz input (upsampled) but slightly lower accuracy.
- Some models trained on 8 kHz; handle phone audio natively.
- Some support multiple rates.
For phone calls, prefer STT trained on 8 kHz โ better accuracy.
TTS sample rate
TTS outputs at native rate (usually 22.05 kHz or 24 kHz). Downstream:
- Phone: downsampled to 8 kHz. Quality loss but unavoidable.
- HD voice: downsampled to 16 kHz. Much less loss.
Format consistency
Audio pipeline needs consistent format:
- Sample rate.
- Bit depth (usually 16-bit).
- Channels (usually mono for voice).
- Encoding (linear PCM typical).
Mismatches cause:
- Silent glitches.
- Wrong speed playback.
- Distortion.
PSTN bandwidth limitation
Even with high-quality TTS, over PSTN:
- Audio capped at 3.4 kHz.
- Subtle TTS quality details lost.
- "Phone quality" sound.
Unavoidable when talking to regular phones.
Subjective quality
- 8 kHz PSTN: recognizable as "phone call." OK.
- 16 kHz HD: "in-person call quality." Clearly better.
- 48 kHz: indistinguishable from in-person for most listeners.
Voice agents on PSTN: 8 kHz. Voice agents in browser/app: 16 kHz easily achievable.
Mobile considerations
Modern mobile:
- Cellular codec may be AMR-WB (16 kHz).
- Transcoded to 8 kHz at PSTN gateway.
- If both parties are mobile, can preserve 16 kHz.
Depends on carrier and path.
WebRTC advantages
WebRTC calls can stay 16 kHz end-to-end:
- Modern browsers.
- Modern networks.
- Opus at wideband.
For voice-in-app scenarios, WebRTC delivers quality advantages.
Measuring sample rate in production
Audio file headers specify sample rate. Spot checks:
- Log sample rate per call.
- Verify STT and TTS configs match.
- Check for unnecessary transcoding.
Tools
- ffmpeg / sox: resample audio files.
- WAV file inspection: header contains sample rate.
- Audio editors: visualize spectrum.
Debugging audio quality
If audio sounds wrong:
- Check sample rates through the pipeline.
- Check for unnecessary resampling.
- Listen to samples at each stage.
Common pitfalls
Mismatched STT rate. STT expects 16 kHz; you feed 8 kHz. May work (with upsampling) but not optimal.
Unnecessary upsampling. 8 kHz phone audio upsampled to 48 kHz before STT. Wasteful.
Bit depth mismatch. 16-bit vs 8-bit. Quality difference audible.
Channel confusion. Stereo mic mixed incorrectly to mono.
Format conversion loss. Multiple format conversions compound loss.
When sample rate matters less
- STT quality plateaus above 16 kHz for voice.
- TTS voice quality plateaus above 22 kHz.
- For phone calls, you're 8 kHz anyway.
When it matters most
- HD voice deployments.
- Music or complex audio in the pipeline.
- Multi-rate pipelines.
Related reading
- Text-to-Speech in 2026: The State of the Art
- Latency Engineering for Real-Time Voice Agents
- How to Benchmark a Voice Agent's End-to-End Latency
- Streaming Audio Over WebRTC for Voice Agents
FAQ
Should I use 16 kHz STT for phone calls? If STT supports 8 kHz well, use that. If only 16 kHz, upsample โ acceptable.
Does TTS sample rate matter for phone delivery? Slightly. High-rate TTS downsampled may preserve more detail than low-rate TTS. Marginal difference.
How do I know what sample rate my pipeline uses? Inspect packets (SIP SDP negotiation) or log config.
Can we upgrade from 8 kHz to 16 kHz mid-call? Not typically. Rate set at call setup.
What about 48 kHz "studio" TTS? Overkill for voice agents. 22-24 kHz is the sweet spot.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all โOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport โ lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
How to Benchmark a Voice Agent's End-to-End Latency
Vendor-reported latency is a lab number. What matters for your voice agent is measured latency in your production environment, under real network conditions, with your actual content.
Comparing Neural TTS Architectures
Neural TTS has evolved rapidly since 2018 โ Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements.
Voice AI, twice a month.
Get the best of the SIMBA resources hub โ new articles, trend notes, and operator guides. No spam.
