Audio Codecs for Voice Agents: Opus, PCMU, and More
Audio codecs determine the quality, bandwidth, and latency of every voice agent call. The choice between G.711, Opus, G.722, and others affects how your audio sounds over the line, how much bandwidth you consume, and how well STT and TTS perform.
Audio codecs determine the quality, bandwidth, and latency of every voice agent call. The choice between G.711, Opus, G.722, and others affects how your audio sounds over the line, how much bandwidth you consume, and how well STT and TTS perform. Most voice agent builders don't think about codecs until something sounds broken โ which is usually too late. This piece covers the practical codec landscape for voice agents.
TL;DR
- G.711 (ฮผ-law / A-law) is the PSTN standard โ uncompressed, 64 kbps, telephony default.
- Opus is the modern default for WebRTC and SIP โ compressed, adaptive, high quality.
- G.722 offers HD voice at moderate bandwidth.
- G.729 is legacy compressed.
- Pick based on connection, bandwidth, quality needs.
The codecs
G.711 (PCMU / PCMA). Also called ฮผ-law (North America) or A-law (Europe). Uncompressed, 64 kbps. Every phone system handles it.
Opus. Modern compressed codec. 6-510 kbps adaptive. WebRTC standard.
G.722. HD voice codec. 64 kbps with better audio quality than G.711. Common in enterprise.
G.729. Compressed 8 kbps codec. Older, efficient, slight quality loss.
AMR, AMR-WB. Mobile codecs. GSM/LTE voice.
Silk. Skype's original codec; now part of Opus.
Sample rates
Sample rate determines maximum audio frequency:
- 8 kHz (narrowband): PSTN default. Human voice intelligible but "phone quality."
- 16 kHz (wideband / HD voice): Much better quality. WebRTC default.
- 48 kHz: Studio quality. Overkill for voice.
STT is tuned for specific sample rates. Match.
G.711 details
- Sample rate: 8 kHz.
- Bitrate: 64 kbps uncompressed.
- Latency: Minimal encoding overhead.
- Quality: Audio cuts off at 3.4 kHz (phone bandwidth).
Universal compatibility. Default for PSTN calls.
Opus details
- Sample rate: 8-48 kHz supported.
- Bitrate: 6-510 kbps adaptive.
- Latency: Low (2.5-60ms algorithmic).
- Quality: Near-transparent at high bitrates.
Modern VoIP and WebRTC default. Flexible.
G.722 details
- Sample rate: 16 kHz (wideband).
- Bitrate: 64 kbps.
- Latency: Low.
- Quality: Noticeably better than G.711.
"HD voice." Common in enterprise.
G.729 details
- Sample rate: 8 kHz.
- Bitrate: 8 kbps.
- Latency: Slight (15ms).
- Quality: Lower than G.711.
Used where bandwidth is precious.
Which codec for what
PSTN calls (phone network): G.711 is the gateway. Opus or G.722 may be used internally and transcoded.
SIP calls (enterprise voice): Opus preferred for modern systems. G.711 for legacy interop.
WebRTC calls (browser voice): Opus default.
Cellular (mobile): Mobile codecs underneath; transcoded to G.711 at the gateway.
STT considerations
STT models are trained on specific sample rates:
- Model trained on 8 kHz handles 8 kHz natively.
- Upsampling 8 kHz to 16 kHz doesn't add information.
- Downsampling 48 kHz to 16 kHz works fine.
Match STT to your actual audio bandwidth.
TTS considerations
TTS models generate at a specific sample rate:
- Most modern TTS outputs 16 kHz or 24 kHz.
- Over PSTN, gets downsampled to 8 kHz.
- Some detail lost.
For phone calls, this is unavoidable; accept it.
Transcoding
Many calls involve codec transcoding:
- WebRTC (Opus 16 kHz) โ SIP (G.711 8 kHz) โ PSTN.
- Each transcoding step has latency and quality cost.
Minimize transcoding where possible.
Bandwidth math
For 100 concurrent calls:
- G.711: 6.4 Mbps (uncompressed voice only).
- Opus at 24 kbps: 2.4 Mbps.
- G.729: 800 kbps.
At scale, codec matters.
Quality considerations
Subjective quality:
- G.711: "phone quality." OK.
- G.722: "HD voice." Noticeably cleaner.
- Opus at 24 kbps: equivalent to G.722.
- Opus at 64 kbps: essentially transparent.
TTS voice quality over codecs
Premium TTS over G.711 (narrowband) loses detail:
- High-frequency components dropped.
- Subtle tonal variations compressed.
- Sounds "flatter."
Opus at wideband preserves more. Where possible, use wideband end-to-end.
Latency considerations
- G.711: minimal (codec adds under 1ms).
- Opus: low (2-60ms depending on frame size).
- G.729: moderate (15ms).
Rarely a practical bottleneck.
Packet loss handling
Some codecs handle packet loss better:
- G.711: sensitive; gaps noticeable.
- Opus: has PLC (Packet Loss Concealment) โ fills gaps intelligently.
- G.729: has some PLC.
For lossy networks, Opus wins.
VoIP quality metrics
Beyond codec:
- Jitter: variable packet timing.
- Packet loss: lost frames.
- Latency: one-way delay.
- MOS (Mean Opinion Score): subjective quality.
Codec is one factor among several.
Configuration
Most voice AI platforms abstract codec choice:
- Twilio: defaults to G.711 for PSTN, Opus for WebRTC.
- Vendor voice AI: often negotiates per call.
- Custom setup: explicit SDP negotiation.
Usually don't need to touch โ but know what's running.
Debugging audio quality
When callers complain about audio:
- Check codec in use.
- Check jitter, packet loss.
- Listen to actual audio.
- Verify sample rate consistency.
Audio issues are often codec/network, not "voice AI."
Common pitfalls
Mismatched sample rates. STT at 8 kHz; TTS output at 16 kHz. Possible but inefficient.
Unnecessary transcoding. Opus โ G.711 โ Opus within a single call. Avoid.
Low bitrate Opus. 6 kbps Opus sounds worse than 64 kbps G.711. Don't push too low.
Ignoring HD voice options. G.722 available; defaults to G.711. Take advantage.
No testing over actual network. Works in lab; degrades over real VoIP.
Related reading
- Diarization: Knowing Who's Speaking in a Voice Conversation
- Text-to-Speech in 2026: The State of the Art
- Latency Engineering for Real-Time Voice Agents
- Streaming Audio Over WebRTC for Voice Agents
FAQ
Does codec matter for AI voice agents specifically? Yes โ STT and TTS quality depend on it.
Can we force HD voice for PSTN calls? No โ PSTN is fundamentally narrowband. End-to-end HD requires non-PSTN path.
What about mobile callers? Cellular codecs vary; transcoded at the gateway to G.711 or Opus.
Is Opus always better than G.711? At equivalent bitrates, yes. At PSTN mandatory G.711, it's moot.
How do we monitor codec usage? Twilio Voice Insights and similar tools show codec per call.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all โOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Diarization: Knowing Who's Speaking in a Voice Conversation
Speaker diarization is the task of answering "who spoke when?" Given audio with multiple speakers, diarization outputs time-stamped segments labeled by speaker. For most voice agent use cases โ one caller, one agent โ diarization is trivial (channel-based separation works).
Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport โ lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
Comparing Neural TTS Architectures
Neural TTS has evolved rapidly since 2018 โ Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements.
Voice AI, twice a month.
Get the best of the SIMBA resources hub โ new articles, trend notes, and operator guides. No spam.
