๐Ÿ”Š Speech Technology

Audio Codecs for Voice Agents: Opus, PCMU, and More

Audio codecs determine the quality, bandwidth, and latency of every voice agent call. The choice between G.711, Opus, G.722, and others affects how your audio sounds over the line, how much bandwidth you consume, and how well STT and TTS perform.

Tyler Weitzman
Tyler Weitzman
March 16, 2026 ยท 5 min read
Speechify

Audio codecs determine the quality, bandwidth, and latency of every voice agent call. The choice between G.711, Opus, G.722, and others affects how your audio sounds over the line, how much bandwidth you consume, and how well STT and TTS perform. Most voice agent builders don't think about codecs until something sounds broken โ€” which is usually too late. This piece covers the practical codec landscape for voice agents.

TL;DR

  • G.711 (ฮผ-law / A-law) is the PSTN standard โ€” uncompressed, 64 kbps, telephony default.
  • Opus is the modern default for WebRTC and SIP โ€” compressed, adaptive, high quality.
  • G.722 offers HD voice at moderate bandwidth.
  • G.729 is legacy compressed.
  • Pick based on connection, bandwidth, quality needs.

The codecs

G.711 (PCMU / PCMA). Also called ฮผ-law (North America) or A-law (Europe). Uncompressed, 64 kbps. Every phone system handles it.

Opus. Modern compressed codec. 6-510 kbps adaptive. WebRTC standard.

G.722. HD voice codec. 64 kbps with better audio quality than G.711. Common in enterprise.

G.729. Compressed 8 kbps codec. Older, efficient, slight quality loss.

AMR, AMR-WB. Mobile codecs. GSM/LTE voice.

Silk. Skype's original codec; now part of Opus.

Sample rates

Sample rate determines maximum audio frequency:

  • 8 kHz (narrowband): PSTN default. Human voice intelligible but "phone quality."
  • 16 kHz (wideband / HD voice): Much better quality. WebRTC default.
  • 48 kHz: Studio quality. Overkill for voice.

STT is tuned for specific sample rates. Match.

G.711 details

  • Sample rate: 8 kHz.
  • Bitrate: 64 kbps uncompressed.
  • Latency: Minimal encoding overhead.
  • Quality: Audio cuts off at 3.4 kHz (phone bandwidth).

Universal compatibility. Default for PSTN calls.

Opus details

  • Sample rate: 8-48 kHz supported.
  • Bitrate: 6-510 kbps adaptive.
  • Latency: Low (2.5-60ms algorithmic).
  • Quality: Near-transparent at high bitrates.

Modern VoIP and WebRTC default. Flexible.

G.722 details

  • Sample rate: 16 kHz (wideband).
  • Bitrate: 64 kbps.
  • Latency: Low.
  • Quality: Noticeably better than G.711.

"HD voice." Common in enterprise.

G.729 details

  • Sample rate: 8 kHz.
  • Bitrate: 8 kbps.
  • Latency: Slight (15ms).
  • Quality: Lower than G.711.

Used where bandwidth is precious.

Which codec for what

PSTN calls (phone network): G.711 is the gateway. Opus or G.722 may be used internally and transcoded.

SIP calls (enterprise voice): Opus preferred for modern systems. G.711 for legacy interop.

WebRTC calls (browser voice): Opus default.

Cellular (mobile): Mobile codecs underneath; transcoded to G.711 at the gateway.

STT considerations

STT models are trained on specific sample rates:

  • Model trained on 8 kHz handles 8 kHz natively.
  • Upsampling 8 kHz to 16 kHz doesn't add information.
  • Downsampling 48 kHz to 16 kHz works fine.

Match STT to your actual audio bandwidth.

TTS considerations

TTS models generate at a specific sample rate:

  • Most modern TTS outputs 16 kHz or 24 kHz.
  • Over PSTN, gets downsampled to 8 kHz.
  • Some detail lost.

For phone calls, this is unavoidable; accept it.

Transcoding

Many calls involve codec transcoding:

  • WebRTC (Opus 16 kHz) โ†’ SIP (G.711 8 kHz) โ†’ PSTN.
  • Each transcoding step has latency and quality cost.

Minimize transcoding where possible.

Bandwidth math

For 100 concurrent calls:

  • G.711: 6.4 Mbps (uncompressed voice only).
  • Opus at 24 kbps: 2.4 Mbps.
  • G.729: 800 kbps.

At scale, codec matters.

Quality considerations

Subjective quality:

  • G.711: "phone quality." OK.
  • G.722: "HD voice." Noticeably cleaner.
  • Opus at 24 kbps: equivalent to G.722.
  • Opus at 64 kbps: essentially transparent.

TTS voice quality over codecs

Premium TTS over G.711 (narrowband) loses detail:

  • High-frequency components dropped.
  • Subtle tonal variations compressed.
  • Sounds "flatter."

Opus at wideband preserves more. Where possible, use wideband end-to-end.

Latency considerations

  • G.711: minimal (codec adds under 1ms).
  • Opus: low (2-60ms depending on frame size).
  • G.729: moderate (15ms).

Rarely a practical bottleneck.

Packet loss handling

Some codecs handle packet loss better:

  • G.711: sensitive; gaps noticeable.
  • Opus: has PLC (Packet Loss Concealment) โ€” fills gaps intelligently.
  • G.729: has some PLC.

For lossy networks, Opus wins.

VoIP quality metrics

Beyond codec:

  • Jitter: variable packet timing.
  • Packet loss: lost frames.
  • Latency: one-way delay.
  • MOS (Mean Opinion Score): subjective quality.

Codec is one factor among several.

Configuration

Most voice AI platforms abstract codec choice:

  • Twilio: defaults to G.711 for PSTN, Opus for WebRTC.
  • Vendor voice AI: often negotiates per call.
  • Custom setup: explicit SDP negotiation.

Usually don't need to touch โ€” but know what's running.

Debugging audio quality

When callers complain about audio:

  • Check codec in use.
  • Check jitter, packet loss.
  • Listen to actual audio.
  • Verify sample rate consistency.

Audio issues are often codec/network, not "voice AI."

Common pitfalls

Mismatched sample rates. STT at 8 kHz; TTS output at 16 kHz. Possible but inefficient.

Unnecessary transcoding. Opus โ†’ G.711 โ†’ Opus within a single call. Avoid.

Low bitrate Opus. 6 kbps Opus sounds worse than 64 kbps G.711. Don't push too low.

Ignoring HD voice options. G.722 available; defaults to G.711. Take advantage.

No testing over actual network. Works in lab; degrades over real VoIP.

FAQ

Does codec matter for AI voice agents specifically? Yes โ€” STT and TTS quality depend on it.

Can we force HD voice for PSTN calls? No โ€” PSTN is fundamentally narrowband. End-to-end HD requires non-PSTN path.

What about mobile callers? Cellular codecs vary; transcoded at the gateway to G.711 or Opus.

Is Opus always better than G.711? At equivalent bitrates, yes. At PSTN mandatory G.711, it's moot.

How do we monitor codec usage? Twilio Voice Insights and similar tools show codec per call.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ€” text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all โ†’

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub โ€” new articles, trend notes, and operator guides. No spam.