๐Ÿ”Š Speech Technology

How Sample Rate Affects Voice Agent Quality

Sample rate is one of those low-level audio details that voice agent builders often inherit without thinking about. The STT config says 16 kHz; the TTS outputs 24 kHz; the PSTN leg is 8 kHz.

Tyler Weitzman
Tyler Weitzman
March 18, 2026 ยท 5 min read
Speechify

Sample rate is one of those low-level audio details that voice agent builders often inherit without thinking about. The STT config says 16 kHz; the TTS outputs 24 kHz; the PSTN leg is 8 kHz. Somewhere in the stack, audio gets resampled, and the resampling may or may not preserve quality. Understanding sample rate well enough to debug quality issues is foundational voice engineering.

TL;DR

  • Sample rate = audio samples per second. Higher = more detail captured.
  • Phone audio is 8 kHz narrowband; WebRTC/HD voice is 16 kHz; studio is 48 kHz.
  • Match STT and TTS sample rates to actual audio; don't upsample blindly.
  • Downsampling is fine; upsampling adds nothing.
  • Inconsistencies cause subtle quality issues.

Sample rate basics

Audio is digitized by sampling the analog waveform. Sample rate determines:

  • Frequency range captured. Max captured frequency = half sample rate (Nyquist).
  • Audio bandwidth.

For human voice:

  • 8 kHz sample rate: captures up to 4 kHz. Cuts off at 3.4 kHz in practice (telephony). Voice intelligible, some detail lost.
  • 16 kHz: captures up to 8 kHz. Much better voice quality.
  • 48 kHz: captures up to 24 kHz. Music-quality; overkill for voice.

Why phone is 8 kHz

PSTN designed in 1960s for voice (not music). Human speech intelligibility maxes out around 3.4 kHz. 8 kHz sampling captures that with headroom. Bandwidth-efficient for networks of that era.

Modern networks could carry more, but PSTN infrastructure is stuck at 8 kHz for compatibility.

HD voice

16 kHz sampling = HD voice. Much better quality:

  • Consonants clearer.
  • Sibilants (s, sh) preserved.
  • Overall more natural.

Delivered via:

  • G.722 codec on SIP/VoIP.
  • Opus wideband on WebRTC.
  • Modern mobile networks (LTE/5G).

Sample rates in the voice pipeline

Typical paths:

Phone (PSTN) call:

  • 8 kHz throughout.
  • STT expects 8 kHz.
  • TTS outputs 24 kHz, downsampled to 8 kHz for delivery.

WebRTC call:

  • 16 kHz wideband throughout.
  • STT at 16 kHz.
  • TTS at 24 kHz, downsampled to 16 kHz.

Match sample rates; don't fight them.

Resampling

Converting between sample rates:

  • Downsampling (higher to lower): removes high-frequency content. Usually fine.
  • Upsampling (lower to higher): interpolates. Doesn't add real information. Sometimes required for STT models but doesn't improve quality.

Downsample: fine. Upsample: required sometimes, not beneficial.

STT sample rate

STT models are trained on specific rates:

  • Some models trained on 16 kHz; handle 8 kHz input (upsampled) but slightly lower accuracy.
  • Some models trained on 8 kHz; handle phone audio natively.
  • Some support multiple rates.

For phone calls, prefer STT trained on 8 kHz โ€” better accuracy.

TTS sample rate

TTS outputs at native rate (usually 22.05 kHz or 24 kHz). Downstream:

  • Phone: downsampled to 8 kHz. Quality loss but unavoidable.
  • HD voice: downsampled to 16 kHz. Much less loss.

Format consistency

Audio pipeline needs consistent format:

  • Sample rate.
  • Bit depth (usually 16-bit).
  • Channels (usually mono for voice).
  • Encoding (linear PCM typical).

Mismatches cause:

  • Silent glitches.
  • Wrong speed playback.
  • Distortion.

PSTN bandwidth limitation

Even with high-quality TTS, over PSTN:

  • Audio capped at 3.4 kHz.
  • Subtle TTS quality details lost.
  • "Phone quality" sound.

Unavoidable when talking to regular phones.

Subjective quality

  • 8 kHz PSTN: recognizable as "phone call." OK.
  • 16 kHz HD: "in-person call quality." Clearly better.
  • 48 kHz: indistinguishable from in-person for most listeners.

Voice agents on PSTN: 8 kHz. Voice agents in browser/app: 16 kHz easily achievable.

Mobile considerations

Modern mobile:

  • Cellular codec may be AMR-WB (16 kHz).
  • Transcoded to 8 kHz at PSTN gateway.
  • If both parties are mobile, can preserve 16 kHz.

Depends on carrier and path.

WebRTC advantages

WebRTC calls can stay 16 kHz end-to-end:

  • Modern browsers.
  • Modern networks.
  • Opus at wideband.

For voice-in-app scenarios, WebRTC delivers quality advantages.

Measuring sample rate in production

Audio file headers specify sample rate. Spot checks:

  • Log sample rate per call.
  • Verify STT and TTS configs match.
  • Check for unnecessary transcoding.

Tools

  • ffmpeg / sox: resample audio files.
  • WAV file inspection: header contains sample rate.
  • Audio editors: visualize spectrum.

Debugging audio quality

If audio sounds wrong:

  • Check sample rates through the pipeline.
  • Check for unnecessary resampling.
  • Listen to samples at each stage.

Common pitfalls

Mismatched STT rate. STT expects 16 kHz; you feed 8 kHz. May work (with upsampling) but not optimal.

Unnecessary upsampling. 8 kHz phone audio upsampled to 48 kHz before STT. Wasteful.

Bit depth mismatch. 16-bit vs 8-bit. Quality difference audible.

Channel confusion. Stereo mic mixed incorrectly to mono.

Format conversion loss. Multiple format conversions compound loss.

When sample rate matters less

  • STT quality plateaus above 16 kHz for voice.
  • TTS voice quality plateaus above 22 kHz.
  • For phone calls, you're 8 kHz anyway.

When it matters most

  • HD voice deployments.
  • Music or complex audio in the pipeline.
  • Multi-rate pipelines.

FAQ

Should I use 16 kHz STT for phone calls? If STT supports 8 kHz well, use that. If only 16 kHz, upsample โ€” acceptable.

Does TTS sample rate matter for phone delivery? Slightly. High-rate TTS downsampled may preserve more detail than low-rate TTS. Marginal difference.

How do I know what sample rate my pipeline uses? Inspect packets (SIP SDP negotiation) or log config.

Can we upgrade from 8 kHz to 16 kHz mid-call? Not typically. Rate set at call setup.

What about 48 kHz "studio" TTS? Overkill for voice agents. 22-24 kHz is the sweet spot.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ€” text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all โ†’

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub โ€” new articles, trend notes, and operator guides. No spam.