Audio codecs determine the quality, bandwidth, and latency of every voice agent call. The choice between G.711, Opus, G.722, and others affects how your audio sounds over the line, how much bandwidth you consume, and how well STT and TTS perform. Most voice agent builders don't think about codecs until something sounds broken — which is usually too late. This piece covers the practical codec landscape for voice agents.

TL;DR

G.711 (μ-law / A-law) is the PSTN standard — uncompressed, 64 kbps, telephony default.
Opus is the modern default for WebRTC and SIP — compressed, adaptive, high quality.
G.722 offers HD voice at moderate bandwidth.
G.729 is legacy compressed.
Pick based on connection, bandwidth, quality needs.

The codecs

G.711 (PCMU / PCMA). Also called μ-law (North America) or A-law (Europe). Uncompressed, 64 kbps. Every phone system handles it.

Opus. Modern compressed codec. 6-510 kbps adaptive. WebRTC standard.

G.722. HD voice codec. 64 kbps with better audio quality than G.711. Common in enterprise.

G.729. Compressed 8 kbps codec. Older, efficient, slight quality loss.

AMR, AMR-WB. Mobile codecs. GSM/LTE voice.

Silk. Skype's original codec; now part of Opus.

Sample rates

Sample rate determines maximum audio frequency:

8 kHz (narrowband): PSTN default. Human voice intelligible but "phone quality."
16 kHz (wideband / HD voice): Much better quality. WebRTC default.
48 kHz: Studio quality. Overkill for voice.

STT is tuned for specific sample rates. Match.

G.711 details

Sample rate: 8 kHz.
Bitrate: 64 kbps uncompressed.
Latency: Minimal encoding overhead.
Quality: Audio cuts off at 3.4 kHz (phone bandwidth).

Universal compatibility. Default for PSTN calls.

Opus details

Sample rate: 8-48 kHz supported.
Bitrate: 6-510 kbps adaptive.
Latency: Low (2.5-60ms algorithmic).
Quality: Near-transparent at high bitrates.

Modern VoIP and WebRTC default. Flexible.

G.722 details

Sample rate: 16 kHz (wideband).
Bitrate: 64 kbps.
Latency: Low.
Quality: Noticeably better than G.711.

"HD voice." Common in enterprise.

G.729 details

Sample rate: 8 kHz.
Bitrate: 8 kbps.
Latency: Slight (15ms).
Quality: Lower than G.711.

Used where bandwidth is precious.

Which codec for what

PSTN calls (phone network): G.711 is the gateway. Opus or G.722 may be used internally and transcoded.

SIP calls (enterprise voice): Opus preferred for modern systems. G.711 for legacy interop.

WebRTC calls (browser voice): Opus default.

Cellular (mobile): Mobile codecs underneath; transcoded to G.711 at the gateway.

STT considerations

STT models are trained on specific sample rates:

Model trained on 8 kHz handles 8 kHz natively.
Upsampling 8 kHz to 16 kHz doesn't add information.
Downsampling 48 kHz to 16 kHz works fine.

Match STT to your actual audio bandwidth.

TTS considerations

TTS models generate at a specific sample rate:

Most modern TTS outputs 16 kHz or 24 kHz.
Over PSTN, gets downsampled to 8 kHz.
Some detail lost.

For phone calls, this is unavoidable; accept it.

Transcoding

Many calls involve codec transcoding:

WebRTC (Opus 16 kHz) → SIP (G.711 8 kHz) → PSTN.
Each transcoding step has latency and quality cost.

Minimize transcoding where possible.

Bandwidth math

For 100 concurrent calls:

G.711: 6.4 Mbps (uncompressed voice only).
Opus at 24 kbps: 2.4 Mbps.
G.729: 800 kbps.

At scale, codec matters.

Quality considerations

Subjective quality:

G.711: "phone quality." OK.
G.722: "HD voice." Noticeably cleaner.
Opus at 24 kbps: equivalent to G.722.
Opus at 64 kbps: essentially transparent.

TTS voice quality over codecs

Premium TTS over G.711 (narrowband) loses detail:

High-frequency components dropped.
Subtle tonal variations compressed.
Sounds "flatter."

Opus at wideband preserves more. Where possible, use wideband end-to-end.

Latency considerations

G.711: minimal (codec adds under 1ms).
Opus: low (2-60ms depending on frame size).
G.729: moderate (15ms).

Rarely a practical bottleneck.

Packet loss handling

Some codecs handle packet loss better:

G.711: sensitive; gaps noticeable.
Opus: has PLC (Packet Loss Concealment) — fills gaps intelligently.
G.729: has some PLC.

For lossy networks, Opus wins.

VoIP quality metrics

Beyond codec:

Jitter: variable packet timing.
Packet loss: lost frames.
Latency: one-way delay.
MOS (Mean Opinion Score): subjective quality.

Codec is one factor among several.

Configuration

Most voice AI platforms abstract codec choice:

Twilio: defaults to G.711 for PSTN, Opus for WebRTC.
Vendor voice AI: often negotiates per call.
Custom setup: explicit SDP negotiation.

Usually don't need to touch — but know what's running.

Debugging audio quality

When callers complain about audio:

Check codec in use.
Check jitter, packet loss.
Listen to actual audio.
Verify sample rate consistency.

Audio issues are often codec/network, not "voice AI."

Common pitfalls

Mismatched sample rates. STT at 8 kHz; TTS output at 16 kHz. Possible but inefficient.

Unnecessary transcoding. Opus → G.711 → Opus within a single call. Avoid.

Low bitrate Opus. 6 kbps Opus sounds worse than 64 kbps G.711. Don't push too low.

Ignoring HD voice options. G.722 available; defaults to G.711. Take advantage.

No testing over actual network. Works in lab; degrades over real VoIP.

FAQ

Does codec matter for AI voice agents specifically? Yes — STT and TTS quality depend on it.

Can we force HD voice for PSTN calls? No — PSTN is fundamentally narrowband. End-to-end HD requires non-PSTN path.

What about mobile callers? Cellular codecs vary; transcoded at the gateway to G.711 or Opus.

Is Opus always better than G.711? At equivalent bitrates, yes. At PSTN mandatory G.711, it's moot.

How do we monitor codec usage? Twilio Voice Insights and similar tools show codec per call.

Audio Codecs for Voice Agents: Opus, PCMU, and More

TL;DR

The codecs

Sample rates

G.711 details

Opus details

G.722 details

G.729 details

Which codec for what

STT considerations

TTS considerations

Transcoding

Bandwidth math

Quality considerations

TTS voice quality over codecs

Latency considerations

Packet loss handling

VoIP quality metrics

Configuration

Debugging audio quality

Common pitfalls

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

Diarization: Knowing Who's Speaking in a Voice Conversation

Streaming Audio Over WebRTC for Voice Agents

Comparing Neural TTS Architectures

Voice AI, twice a month.