Sample rate is one of those low-level audio details that voice agent builders often inherit without thinking about. The STT config says 16 kHz; the TTS outputs 24 kHz; the PSTN leg is 8 kHz. Somewhere in the stack, audio gets resampled, and the resampling may or may not preserve quality. Understanding sample rate well enough to debug quality issues is foundational voice engineering.

TL;DR

Sample rate = audio samples per second. Higher = more detail captured.
Phone audio is 8 kHz narrowband; WebRTC/HD voice is 16 kHz; studio is 48 kHz.
Match STT and TTS sample rates to actual audio; don't upsample blindly.
Downsampling is fine; upsampling adds nothing.
Inconsistencies cause subtle quality issues.

Sample rate basics

Audio is digitized by sampling the analog waveform. Sample rate determines:

Frequency range captured. Max captured frequency = half sample rate (Nyquist).
Audio bandwidth.

For human voice:

8 kHz sample rate: captures up to 4 kHz. Cuts off at 3.4 kHz in practice (telephony). Voice intelligible, some detail lost.
16 kHz: captures up to 8 kHz. Much better voice quality.
48 kHz: captures up to 24 kHz. Music-quality; overkill for voice.

Why phone is 8 kHz

PSTN designed in 1960s for voice (not music). Human speech intelligibility maxes out around 3.4 kHz. 8 kHz sampling captures that with headroom. Bandwidth-efficient for networks of that era.

Modern networks could carry more, but PSTN infrastructure is stuck at 8 kHz for compatibility.

HD voice

16 kHz sampling = HD voice. Much better quality:

Consonants clearer.
Sibilants (s, sh) preserved.
Overall more natural.

Delivered via:

G.722 codec on SIP/VoIP.
Opus wideband on WebRTC.
Modern mobile networks (LTE/5G).

Sample rates in the voice pipeline

Typical paths:

Phone (PSTN) call:

8 kHz throughout.
STT expects 8 kHz.
TTS outputs 24 kHz, downsampled to 8 kHz for delivery.

WebRTC call:

16 kHz wideband throughout.
STT at 16 kHz.
TTS at 24 kHz, downsampled to 16 kHz.

Match sample rates; don't fight them.

Resampling

Converting between sample rates:

Downsampling (higher to lower): removes high-frequency content. Usually fine.
Upsampling (lower to higher): interpolates. Doesn't add real information. Sometimes required for STT models but doesn't improve quality.

Downsample: fine. Upsample: required sometimes, not beneficial.

STT sample rate

STT models are trained on specific rates:

Some models trained on 16 kHz; handle 8 kHz input (upsampled) but slightly lower accuracy.
Some models trained on 8 kHz; handle phone audio natively.
Some support multiple rates.

For phone calls, prefer STT trained on 8 kHz — better accuracy.

TTS sample rate

TTS outputs at native rate (usually 22.05 kHz or 24 kHz). Downstream:

Phone: downsampled to 8 kHz. Quality loss but unavoidable.
HD voice: downsampled to 16 kHz. Much less loss.

Format consistency

Audio pipeline needs consistent format:

Sample rate.
Bit depth (usually 16-bit).
Channels (usually mono for voice).
Encoding (linear PCM typical).

Mismatches cause:

Silent glitches.
Wrong speed playback.
Distortion.

PSTN bandwidth limitation

Even with high-quality TTS, over PSTN:

Audio capped at 3.4 kHz.
Subtle TTS quality details lost.
"Phone quality" sound.

Unavoidable when talking to regular phones.

Subjective quality

8 kHz PSTN: recognizable as "phone call." OK.
16 kHz HD: "in-person call quality." Clearly better.
48 kHz: indistinguishable from in-person for most listeners.

Voice agents on PSTN: 8 kHz. Voice agents in browser/app: 16 kHz easily achievable.

Mobile considerations

Modern mobile:

Cellular codec may be AMR-WB (16 kHz).
Transcoded to 8 kHz at PSTN gateway.
If both parties are mobile, can preserve 16 kHz.

Depends on carrier and path.

WebRTC advantages

WebRTC calls can stay 16 kHz end-to-end:

Modern browsers.
Modern networks.
Opus at wideband.

For voice-in-app scenarios, WebRTC delivers quality advantages.

Measuring sample rate in production

Audio file headers specify sample rate. Spot checks:

Log sample rate per call.
Verify STT and TTS configs match.
Check for unnecessary transcoding.

Tools

ffmpeg / sox: resample audio files.
WAV file inspection: header contains sample rate.
Audio editors: visualize spectrum.

Debugging audio quality

If audio sounds wrong:

Check sample rates through the pipeline.
Check for unnecessary resampling.
Listen to samples at each stage.

Common pitfalls

Mismatched STT rate. STT expects 16 kHz; you feed 8 kHz. May work (with upsampling) but not optimal.

Unnecessary upsampling. 8 kHz phone audio upsampled to 48 kHz before STT. Wasteful.

Bit depth mismatch. 16-bit vs 8-bit. Quality difference audible.

Channel confusion. Stereo mic mixed incorrectly to mono.

Format conversion loss. Multiple format conversions compound loss.

When sample rate matters less

STT quality plateaus above 16 kHz for voice.
TTS voice quality plateaus above 22 kHz.
For phone calls, you're 8 kHz anyway.

When it matters most

HD voice deployments.
Music or complex audio in the pipeline.
Multi-rate pipelines.

FAQ

Should I use 16 kHz STT for phone calls? If STT supports 8 kHz well, use that. If only 16 kHz, upsample — acceptable.

Does TTS sample rate matter for phone delivery? Slightly. High-rate TTS downsampled may preserve more detail than low-rate TTS. Marginal difference.

How do I know what sample rate my pipeline uses? Inspect packets (SIP SDP negotiation) or log config.

Can we upgrade from 8 kHz to 16 kHz mid-call? Not typically. Rate set at call setup.

What about 48 kHz "studio" TTS? Overkill for voice agents. 22-24 kHz is the sweet spot.

How Sample Rate Affects Voice Agent Quality

TL;DR

Sample rate basics

Why phone is 8 kHz

HD voice

Sample rates in the voice pipeline

Resampling

STT sample rate

TTS sample rate

Format consistency

PSTN bandwidth limitation

Subjective quality

Mobile considerations

WebRTC advantages

Measuring sample rate in production

Tools

Debugging audio quality

Common pitfalls

When sample rate matters less

When it matters most

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

Streaming Audio Over WebRTC for Voice Agents

How to Benchmark a Voice Agent's End-to-End Latency

Comparing Neural TTS Architectures

Voice AI, twice a month.