🔊 Speech Technology

Streaming Audio Over WebRTC for Voice Agents

WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport — lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.

Tyler Weitzman
Tyler Weitzman
March 21, 2026 · 5 min read
Speechify

WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport — lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform. Understanding how WebRTC fits into voice agent architectures, and what tradeoffs come with it, is foundational for anyone building browser or in-app voice experiences.

TL;DR

  • WebRTC provides real-time audio between browser/app and voice agent server.
  • Lower latency than webhook-based telephony; better for in-app experiences.
  • Built-in encryption (DTLS-SRTP), NAT traversal (ICE/STUN/TURN).
  • Opus codec standard; 16 kHz or 48 kHz typical.
  • Requires signaling server, STUN/TURN infrastructure, and WebRTC-aware voice agent.

WebRTC overview

Core components:

  • Signaling: establishes the call (via WebSocket usually).
  • ICE (Interactive Connectivity Establishment): NAT traversal.
  • STUN: discover public IP through NAT.
  • TURN: relay when direct P2P fails.
  • DTLS-SRTP: encryption.
  • Opus: audio codec standard.

Lots of moving parts; mostly handled by WebRTC library.

Why WebRTC for voice agents

Lower latency. Direct browser-to-server audio. No HTTP webhook round-trip.

Browser-native. No additional plugins; works on Chrome, Firefox, Safari, Edge.

Mobile native. iOS/Android support via WebView or native WebRTC SDKs.

Encryption default. DTLS-SRTP built in.

NAT-friendly. ICE handles most NAT scenarios.

Peer-to-peer possible. Though voice agents usually need server.

The signaling server

Even "peer-to-peer" WebRTC needs signaling:

  • Client says "I want a call."
  • Server coordinates connection.
  • Usually WebSocket-based.

Signaling server handles:

  • SDP (Session Description Protocol) exchange.
  • ICE candidate exchange.
  • Session lifecycle.

The audio server

Voice agents aren't truly peer-to-peer; the "peer" is your server.

Server:

  • Accepts WebRTC connection.
  • Receives audio from client.
  • Passes to STT.
  • Generates response via LLM.
  • Synthesizes via TTS.
  • Sends audio back via WebRTC.

Frameworks like Pipecat, LiveKit Agents handle this.

Architecture

Typical:

Browser/App (WebRTC client) 
  ↔ Signaling Server (WebSocket)
  ↔ Voice Agent Server (WebRTC endpoint)
    ↔ STT / LLM / TTS services

All on-server processing.

Opus codec

WebRTC standard codec. Adaptive bitrate:

  • 6 kbps: poor quality, conservative.
  • 24 kbps: good quality.
  • 64 kbps: near-transparent.

For voice agents, 24-32 kbps usually sufficient.

See audio codecs for voice agents: Opus, PCMU, and more.

Sample rate

WebRTC supports:

  • 8 kHz: narrowband.
  • 16 kHz: wideband (default).
  • 48 kHz: full band.

16 kHz is sweet spot. Aligns with most STT.

TURN servers

When direct P2P fails (NAT, firewalls):

  • Client connects to TURN server.
  • TURN relays traffic.
  • Bandwidth and latency cost.

Need TURN for reliability. ~5-15% of connections require it.

Free options (Google's public STUN) for initial discovery. Production needs hosted TURN.

Encryption

DTLS-SRTP:

  • Mandatory in WebRTC.
  • End-to-end encrypted audio.
  • Can't read without MITM.

For voice agents, your server sees decrypted audio — necessary for STT.

Latency

WebRTC audio transport:

  • 20-80ms typical RTT.
  • End-to-end voice agent: plus processing stack.
  • Better than HTTP webhook (saves 50-150ms).

Mobile considerations

iOS/Android:

  • Native WebRTC SDKs.
  • In-app voice agent experiences feasible.
  • Battery and network impact moderate.

Integration with existing voice

Can bridge WebRTC ↔ SIP ↔ PSTN:

  • Browser caller.
  • WebRTC to gateway.
  • Gateway to SIP trunk to PSTN.

Complex but possible. See SIP vs WebRTC for voice agents.

Frameworks

  • LiveKit Agents. Opinionated, WebRTC-native, strong DX.
  • Pipecat. Framework for voice agents, WebRTC transport supported.
  • Custom. Build on WebRTC library (aiortc, webrtc-sdk).

Most builders use frameworks rather than raw WebRTC.

Signaling implementation

WebSocket-based:

Client → WS → Signaling Server
  CONNECT → SDP_OFFER → SDP_ANSWER → ICE_CANDIDATES
  → DTLS handshake → AUDIO streams

Standard flow; abstracted by WebRTC libraries.

Handling reconnects

Connection drops:

  • WebRTC library detects.
  • Signaling re-establishes.
  • ICE restarts.
  • Audio resumes.

Depends on framework handling. Some graceful; some drop call.

Observability

  • Connection state logs.
  • Audio stats (jitter, loss, bandwidth).
  • Codec negotiation.
  • Packet-level metrics if needed.

WebRTC has built-in getStats API; use it.

Common pitfalls

No TURN server. 5-15% of connections fail silently.

Insufficient bandwidth. Opus adapts but bad experience at very low bandwidth.

Browser compatibility assumptions. Test all target browsers.

Mobile battery drain. WebRTC is bandwidth-ish; mobile users notice.

Non-WebRTC-aware STT/TTS. STT expects chunks; your WebRTC stream delivers. Bridge properly.

Security

  • HTTPS for signaling.
  • DTLS-SRTP for media (automatic).
  • Authentication at signaling layer.
  • Rate limiting.

Standard web security practices apply.

When to prefer WebRTC

  • Browser voice agent experiences.
  • Mobile app embedded voice.
  • Low latency critical (sub-400ms end-to-end).
  • In-product voice (assistant inside SaaS app).

When to prefer SIP / webhook

  • Phone calls (PSTN).
  • Enterprise telephony integration.
  • Legacy infrastructure.

Hybrid

Run both:

  • WebRTC for browser/app users.
  • SIP/PSTN for phone users.
  • Same backend voice agent.

Common architecture in 2026.

FAQ

Can WebRTC handle multi-party calls? Yes — designed for conferencing. Voice agents usually one-to-one though.

What about iOS Safari quirks? Well-supported now; was problematic in 2022.

Do we need our own TURN server? Usually yes for production. Free options unreliable.

Can we record WebRTC audio? Yes, on server side. Same compliance rules apply.

How does WebRTC compare to WebSocket audio? WebRTC is purpose-built for real-time audio. WebSocket is general; works but less optimized.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all →

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.