Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport — lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport — lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform. Understanding how WebRTC fits into voice agent architectures, and what tradeoffs come with it, is foundational for anyone building browser or in-app voice experiences.
TL;DR
- WebRTC provides real-time audio between browser/app and voice agent server.
- Lower latency than webhook-based telephony; better for in-app experiences.
- Built-in encryption (DTLS-SRTP), NAT traversal (ICE/STUN/TURN).
- Opus codec standard; 16 kHz or 48 kHz typical.
- Requires signaling server, STUN/TURN infrastructure, and WebRTC-aware voice agent.
WebRTC overview
Core components:
- Signaling: establishes the call (via WebSocket usually).
- ICE (Interactive Connectivity Establishment): NAT traversal.
- STUN: discover public IP through NAT.
- TURN: relay when direct P2P fails.
- DTLS-SRTP: encryption.
- Opus: audio codec standard.
Lots of moving parts; mostly handled by WebRTC library.
Why WebRTC for voice agents
Lower latency. Direct browser-to-server audio. No HTTP webhook round-trip.
Browser-native. No additional plugins; works on Chrome, Firefox, Safari, Edge.
Mobile native. iOS/Android support via WebView or native WebRTC SDKs.
Encryption default. DTLS-SRTP built in.
NAT-friendly. ICE handles most NAT scenarios.
Peer-to-peer possible. Though voice agents usually need server.
The signaling server
Even "peer-to-peer" WebRTC needs signaling:
- Client says "I want a call."
- Server coordinates connection.
- Usually WebSocket-based.
Signaling server handles:
- SDP (Session Description Protocol) exchange.
- ICE candidate exchange.
- Session lifecycle.
The audio server
Voice agents aren't truly peer-to-peer; the "peer" is your server.
Server:
- Accepts WebRTC connection.
- Receives audio from client.
- Passes to STT.
- Generates response via LLM.
- Synthesizes via TTS.
- Sends audio back via WebRTC.
Frameworks like Pipecat, LiveKit Agents handle this.
Architecture
Typical:
Browser/App (WebRTC client)
↔ Signaling Server (WebSocket)
↔ Voice Agent Server (WebRTC endpoint)
↔ STT / LLM / TTS services
All on-server processing.
Opus codec
WebRTC standard codec. Adaptive bitrate:
- 6 kbps: poor quality, conservative.
- 24 kbps: good quality.
- 64 kbps: near-transparent.
For voice agents, 24-32 kbps usually sufficient.
See audio codecs for voice agents: Opus, PCMU, and more.
Sample rate
WebRTC supports:
- 8 kHz: narrowband.
- 16 kHz: wideband (default).
- 48 kHz: full band.
16 kHz is sweet spot. Aligns with most STT.
TURN servers
When direct P2P fails (NAT, firewalls):
- Client connects to TURN server.
- TURN relays traffic.
- Bandwidth and latency cost.
Need TURN for reliability. ~5-15% of connections require it.
Free options (Google's public STUN) for initial discovery. Production needs hosted TURN.
Encryption
DTLS-SRTP:
- Mandatory in WebRTC.
- End-to-end encrypted audio.
- Can't read without MITM.
For voice agents, your server sees decrypted audio — necessary for STT.
Latency
WebRTC audio transport:
- 20-80ms typical RTT.
- End-to-end voice agent: plus processing stack.
- Better than HTTP webhook (saves 50-150ms).
Mobile considerations
iOS/Android:
- Native WebRTC SDKs.
- In-app voice agent experiences feasible.
- Battery and network impact moderate.
Integration with existing voice
Can bridge WebRTC ↔ SIP ↔ PSTN:
- Browser caller.
- WebRTC to gateway.
- Gateway to SIP trunk to PSTN.
Complex but possible. See SIP vs WebRTC for voice agents.
Frameworks
- LiveKit Agents. Opinionated, WebRTC-native, strong DX.
- Pipecat. Framework for voice agents, WebRTC transport supported.
- Custom. Build on WebRTC library (aiortc, webrtc-sdk).
Most builders use frameworks rather than raw WebRTC.
Signaling implementation
WebSocket-based:
Client → WS → Signaling Server
CONNECT → SDP_OFFER → SDP_ANSWER → ICE_CANDIDATES
→ DTLS handshake → AUDIO streams
Standard flow; abstracted by WebRTC libraries.
Handling reconnects
Connection drops:
- WebRTC library detects.
- Signaling re-establishes.
- ICE restarts.
- Audio resumes.
Depends on framework handling. Some graceful; some drop call.
Observability
- Connection state logs.
- Audio stats (jitter, loss, bandwidth).
- Codec negotiation.
- Packet-level metrics if needed.
WebRTC has built-in getStats API; use it.
Common pitfalls
No TURN server. 5-15% of connections fail silently.
Insufficient bandwidth. Opus adapts but bad experience at very low bandwidth.
Browser compatibility assumptions. Test all target browsers.
Mobile battery drain. WebRTC is bandwidth-ish; mobile users notice.
Non-WebRTC-aware STT/TTS. STT expects chunks; your WebRTC stream delivers. Bridge properly.
Security
- HTTPS for signaling.
- DTLS-SRTP for media (automatic).
- Authentication at signaling layer.
- Rate limiting.
Standard web security practices apply.
When to prefer WebRTC
- Browser voice agent experiences.
- Mobile app embedded voice.
- Low latency critical (sub-400ms end-to-end).
- In-product voice (assistant inside SaaS app).
When to prefer SIP / webhook
- Phone calls (PSTN).
- Enterprise telephony integration.
- Legacy infrastructure.
Hybrid
Run both:
- WebRTC for browser/app users.
- SIP/PSTN for phone users.
- Same backend voice agent.
Common architecture in 2026.
Related reading
- Latency Engineering for Real-Time Voice Agents
- Echo Cancellation in Real-Time Voice AI
- The Engineering Behind Sub-Second Voice Agents
- Text-to-Speech in 2026: The State of the Art
- How to Benchmark a Voice Agent's End-to-End Latency
FAQ
Can WebRTC handle multi-party calls? Yes — designed for conferencing. Voice agents usually one-to-one though.
What about iOS Safari quirks? Well-supported now; was problematic in 2022.
Do we need our own TURN server? Usually yes for production. Free options unreliable.
Can we record WebRTC audio? Yes, on server side. Same compliance rules apply.
How does WebRTC compare to WebSocket audio? WebRTC is purpose-built for real-time audio. WebSocket is general; works but less optimized.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all →Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Echo Cancellation in Real-Time Voice AI
Echo in voice agent calls sounds like this: agent starts speaking, caller's speaker plays agent's voice, caller's microphone picks up agent's voice, the audio flows back to the agent, agent's STT transcribes its own speech, agent gets confused, conversation breaks down.
The Engineering Behind Sub-Second Voice Agents
Sub-second voice agents — end-to-end latency under 1000ms from caller speech end to agent speech start — used to be aspirational. In 2026 it's table stakes for production voice AI, and leading deployments are hitting sub-500ms.
Latency Engineering for Real-Time Voice Agents
Latency is what separates voice agents that feel conversational from those that feel broken. Humans expect responses within 700ms of finishing a sentence — anything longer triggers a "did they hear me?" reaction. Sub-500ms feels alive. Sub-300ms feels exceptional.
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
