πŸ”Œ Integrations & Telephony

SIP Trunking 101 for Voice Agent Builders

SIP trunking is the unsexy plumbing that makes voice agents work at scale. It's the protocol and infrastructure that lets calls move between the public phone network and your voice AI without relying on a telephony provider's proprietary APIs.

Tyler Weitzman
Tyler Weitzman
March 22, 2026 Β· 7 min read
Speechify

SIP trunking is the unsexy plumbing that makes voice agents work at scale. It's the protocol and infrastructure that lets calls move between the public phone network and your voice AI without relying on a telephony provider's proprietary APIs. If you're building a voice agent for an enterprise, or running at high volume with tight latency requirements, you'll end up with SIP somewhere in your stack. This piece is the working engineer's primer β€” what SIP is, what trunking means, how it interacts with voice AI, and the operational considerations that bite you if you skip them.

TL;DR

  • SIP (Session Initiation Protocol) is the standard for Voice-over-IP call signaling.
  • SIP trunking means carrying calls between your infrastructure and a carrier.
  • For voice AI: SIP integration gives you lower latency and more control than webhook-based telephony.
  • Setup involves SIP domains, trunk configuration, codec negotiation, and authentication.
  • Common gotchas: NAT traversal, codec mismatches, one-way audio, carrier certification.

What SIP is

SIP is the signaling protocol for starting, modifying, and ending real-time sessions β€” primarily voice and video. It's a text-based protocol, HTTP-like in structure, that handles the "hello, I want to call you" part of a call. The actual audio flows over RTP (Real-time Transport Protocol), typically negotiated during SIP setup.

A simplified SIP call setup:

  1. Caller's device sends INVITE to callee.
  2. Callee responds 180 Ringing.
  3. Callee answers, responds 200 OK.
  4. Caller acknowledges with ACK.
  5. RTP media flows.
  6. Either party sends BYE to end.

For voice AI, SIP is how you plug into the phone network without going through a vendor's webhooks.

What trunking means

A SIP trunk is a connection between your SIP infrastructure and a carrier's SIP infrastructure that carries multiple concurrent calls. Instead of one analog phone line per concurrent call, a SIP trunk carries dozens or hundreds.

Trunks are sized by "concurrent call capacity" (CCC). A 50-CCC trunk can handle 50 simultaneous calls.

Common SIP trunk providers:

  • Twilio (Elastic SIP Trunking)
  • Bandwidth
  • Telnyx
  • Vonage
  • RingCentral
  • Nexmo (acquired by Vonage)
  • SignalWire

Each has trade-offs on pricing, coverage, quality, and API friendliness.

Why voice AI teams care

Voice AI can interact with telephony in a few ways:

Webhook-based. Twilio-style. Call comes in, Twilio hits your webhook, you respond with TwiML or media streams. Simple. Higher latency because of the HTTP round-trip.

SIP-based. Your voice AI is itself a SIP endpoint. Calls route directly to it via SIP INVITE. Lower latency, more control.

Hybrid. Webhook for setup, SIP for media streaming. Common in modern architectures.

For production voice AI at scale, SIP integration often wins on latency. For development and mid-scale, webhook is simpler.

The components

SIP server / PBX. Handles SIP signaling. Open-source options: FreeSWITCH, Asterisk, Kamailio. Commercial: Cisco CUCM, Avaya, etc. Cloud-native: Twilio SIP Domain, Vonage, etc.

Media server. Handles RTP audio. Often integrated with SIP server. For voice AI, this is where STT/TTS plug in.

Session Border Controller (SBC). At the network edge, handles security, NAT, codec transcoding. Enterprise-scale requires this; small deployments can often skip.

Carrier. Your SIP trunk provider. Connects you to the PSTN.

Codecs

Codecs compress/decompress audio. SIP negotiates which codec to use during call setup.

Common codecs:

  • G.711 (Β΅-law / A-law). Uncompressed, 64 kbps. Best quality. US standard.
  • Opus. Modern, compressed, 6–510 kbps. Popular for VoIP.
  • G.722. HD voice, compressed. Used in some enterprise.
  • G.729. Compressed, 8 kbps. Older but efficient.

For voice AI, G.711 and Opus are most common. Opus at 16-24 kbps gives near-PSTN quality at much lower bandwidth.

See audio codecs for voice agents: Opus, PCMU, and more.

Authentication

SIP trunks authenticate via:

  • IP whitelisting. Only calls from specified IPs accepted. Simple, less secure.
  • Digest authentication. Username/password per call. More secure.
  • TLS + Digest. Encrypted signaling plus auth. Standard for enterprise.

Use TLS for production. Unencrypted SIP is a security risk.

NAT traversal

A common source of operational pain. SIP and RTP use UDP, which doesn't play well with NAT. Symptoms:

  • One-way audio (you hear them, they don't hear you, or vice versa).
  • Calls drop after 30 seconds.
  • No audio at all.

Solutions:

  • Static public IPs. No NAT, no problem.
  • SBC in front. Translates NAT-ed addresses.
  • STUN/TURN servers. Help clients discover their public IP.
  • rtpengine or similar media proxies.

Don't deploy SIP from behind carrier-grade NAT without planning for this.

Common SIP signaling issues

487 Request Terminated. Call canceled before answer. Usually caller hung up.

480 Temporarily Unavailable. Callee rejected. Why varies.

503 Service Unavailable. Carrier or trunk issue.

4xx errors in general. Your side's problem (bad request, auth failure).

5xx errors. Carrier or upstream issue.

Call completes but no audio. RTP / NAT problem.

Your SIP logs are your friend.

Media handling for voice AI

Once the call is connected, audio flows over RTP. For voice AI:

  • Receive. Your AI receives RTP frames from the caller, feeds them to STT.
  • Send. TTS generates audio, encoded as RTP frames, sent back to the caller.
  • Sync. Turn-taking, barge-in, silence detection β€” all based on RTP timing.

Low-level audio handling is often abstracted by voice AI frameworks (Pipecat, LiveKit Agents). Direct RTP handling is for deep infrastructure teams.

Carrier certification

Many carriers require certification before accepting production SIP traffic:

  • Interop testing. Verify SIP signaling compatibility.
  • Codec testing. Ensure codecs negotiate correctly.
  • Load testing. Confirm you can handle the expected CCC.
  • Failover testing. Plan for carrier-side outages.

Larger carriers (Verizon, AT&T) have formal certification. Smaller carriers may be more relaxed.

Deployment considerations

Primary and backup trunks. Don't single-carrier yourself at scale.

Geographic routing. Route calls to the nearest voice AI region for latency.

Concurrent call capacity planning. Size trunks for peak + headroom.

Monitoring. Call quality, drop rate, setup failure rate.

SBC vs no SBC. Production usually wants an SBC. Small deployments can often skip.

When to use SIP vs webhook

Use SIP when:

  • Latency requirements are strict (sub-400ms end-to-end).
  • Volume is high enough for cost differences to matter.
  • Enterprise SIP infrastructure already exists.
  • You need features webhooks don't expose cleanly (complex call flows, transfer, conference).

Use webhooks when:

  • You're early in development.
  • Volume is modest.
  • You want simpler operations.
  • You're happy with Twilio-style abstractions.

For the comparison, see SIP vs WebRTC for voice agents.

Operational metrics

Beyond call handle metrics, SIP-specific:

  • Call setup success rate. % of INVITEs that become answered calls.
  • Post-dial delay (PDD). Time from INVITE to ring.
  • Call drop rate. % of calls dropping mid-call.
  • Jitter and packet loss. Audio quality indicators.
  • MOS (Mean Opinion Score). Subjective call quality.

FAQ

Do I need to understand SIP to use voice AI? Not for managed deployments. For enterprise at scale, yes.

What's a reasonable CCC to start? Depends on call volume. At 10 calls/minute avg, with 3-minute avg duration, you need ~30 CCC. Plus 50% headroom.

Can I self-host SIP? Yes, with FreeSWITCH or Asterisk. Non-trivial operationally.

What about G.711 vs Opus for voice AI? G.711 is simpler (uncompressed). Opus is better quality at lower bandwidth. Modern voice AI works with either.

How do we handle call recording over SIP? Tap the RTP stream at the SBC or the voice AI layer. Compliance implications apply.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems β€” text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all β†’

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub β€” new articles, trend notes, and operator guides. No spam.