SIP Trunking 101 for Voice Agent Builders
SIP trunking is the unsexy plumbing that makes voice agents work at scale. It's the protocol and infrastructure that lets calls move between the public phone network and your voice AI without relying on a telephony provider's proprietary APIs.
SIP trunking is the unsexy plumbing that makes voice agents work at scale. It's the protocol and infrastructure that lets calls move between the public phone network and your voice AI without relying on a telephony provider's proprietary APIs. If you're building a voice agent for an enterprise, or running at high volume with tight latency requirements, you'll end up with SIP somewhere in your stack. This piece is the working engineer's primer β what SIP is, what trunking means, how it interacts with voice AI, and the operational considerations that bite you if you skip them.
TL;DR
- SIP (Session Initiation Protocol) is the standard for Voice-over-IP call signaling.
- SIP trunking means carrying calls between your infrastructure and a carrier.
- For voice AI: SIP integration gives you lower latency and more control than webhook-based telephony.
- Setup involves SIP domains, trunk configuration, codec negotiation, and authentication.
- Common gotchas: NAT traversal, codec mismatches, one-way audio, carrier certification.
What SIP is
SIP is the signaling protocol for starting, modifying, and ending real-time sessions β primarily voice and video. It's a text-based protocol, HTTP-like in structure, that handles the "hello, I want to call you" part of a call. The actual audio flows over RTP (Real-time Transport Protocol), typically negotiated during SIP setup.
A simplified SIP call setup:
- Caller's device sends
INVITEto callee. - Callee responds
180 Ringing. - Callee answers, responds
200 OK. - Caller acknowledges with
ACK. - RTP media flows.
- Either party sends
BYEto end.
For voice AI, SIP is how you plug into the phone network without going through a vendor's webhooks.
What trunking means
A SIP trunk is a connection between your SIP infrastructure and a carrier's SIP infrastructure that carries multiple concurrent calls. Instead of one analog phone line per concurrent call, a SIP trunk carries dozens or hundreds.
Trunks are sized by "concurrent call capacity" (CCC). A 50-CCC trunk can handle 50 simultaneous calls.
Common SIP trunk providers:
- Twilio (Elastic SIP Trunking)
- Bandwidth
- Telnyx
- Vonage
- RingCentral
- Nexmo (acquired by Vonage)
- SignalWire
Each has trade-offs on pricing, coverage, quality, and API friendliness.
Why voice AI teams care
Voice AI can interact with telephony in a few ways:
Webhook-based. Twilio-style. Call comes in, Twilio hits your webhook, you respond with TwiML or media streams. Simple. Higher latency because of the HTTP round-trip.
SIP-based. Your voice AI is itself a SIP endpoint. Calls route directly to it via SIP INVITE. Lower latency, more control.
Hybrid. Webhook for setup, SIP for media streaming. Common in modern architectures.
For production voice AI at scale, SIP integration often wins on latency. For development and mid-scale, webhook is simpler.
The components
SIP server / PBX. Handles SIP signaling. Open-source options: FreeSWITCH, Asterisk, Kamailio. Commercial: Cisco CUCM, Avaya, etc. Cloud-native: Twilio SIP Domain, Vonage, etc.
Media server. Handles RTP audio. Often integrated with SIP server. For voice AI, this is where STT/TTS plug in.
Session Border Controller (SBC). At the network edge, handles security, NAT, codec transcoding. Enterprise-scale requires this; small deployments can often skip.
Carrier. Your SIP trunk provider. Connects you to the PSTN.
Codecs
Codecs compress/decompress audio. SIP negotiates which codec to use during call setup.
Common codecs:
- G.711 (Β΅-law / A-law). Uncompressed, 64 kbps. Best quality. US standard.
- Opus. Modern, compressed, 6β510 kbps. Popular for VoIP.
- G.722. HD voice, compressed. Used in some enterprise.
- G.729. Compressed, 8 kbps. Older but efficient.
For voice AI, G.711 and Opus are most common. Opus at 16-24 kbps gives near-PSTN quality at much lower bandwidth.
See audio codecs for voice agents: Opus, PCMU, and more.
Authentication
SIP trunks authenticate via:
- IP whitelisting. Only calls from specified IPs accepted. Simple, less secure.
- Digest authentication. Username/password per call. More secure.
- TLS + Digest. Encrypted signaling plus auth. Standard for enterprise.
Use TLS for production. Unencrypted SIP is a security risk.
NAT traversal
A common source of operational pain. SIP and RTP use UDP, which doesn't play well with NAT. Symptoms:
- One-way audio (you hear them, they don't hear you, or vice versa).
- Calls drop after 30 seconds.
- No audio at all.
Solutions:
- Static public IPs. No NAT, no problem.
- SBC in front. Translates NAT-ed addresses.
- STUN/TURN servers. Help clients discover their public IP.
- rtpengine or similar media proxies.
Don't deploy SIP from behind carrier-grade NAT without planning for this.
Common SIP signaling issues
487 Request Terminated. Call canceled before answer. Usually caller hung up.
480 Temporarily Unavailable. Callee rejected. Why varies.
503 Service Unavailable. Carrier or trunk issue.
4xx errors in general. Your side's problem (bad request, auth failure).
5xx errors. Carrier or upstream issue.
Call completes but no audio. RTP / NAT problem.
Your SIP logs are your friend.
Media handling for voice AI
Once the call is connected, audio flows over RTP. For voice AI:
- Receive. Your AI receives RTP frames from the caller, feeds them to STT.
- Send. TTS generates audio, encoded as RTP frames, sent back to the caller.
- Sync. Turn-taking, barge-in, silence detection β all based on RTP timing.
Low-level audio handling is often abstracted by voice AI frameworks (Pipecat, LiveKit Agents). Direct RTP handling is for deep infrastructure teams.
Carrier certification
Many carriers require certification before accepting production SIP traffic:
- Interop testing. Verify SIP signaling compatibility.
- Codec testing. Ensure codecs negotiate correctly.
- Load testing. Confirm you can handle the expected CCC.
- Failover testing. Plan for carrier-side outages.
Larger carriers (Verizon, AT&T) have formal certification. Smaller carriers may be more relaxed.
Deployment considerations
Primary and backup trunks. Don't single-carrier yourself at scale.
Geographic routing. Route calls to the nearest voice AI region for latency.
Concurrent call capacity planning. Size trunks for peak + headroom.
Monitoring. Call quality, drop rate, setup failure rate.
SBC vs no SBC. Production usually wants an SBC. Small deployments can often skip.
When to use SIP vs webhook
Use SIP when:
- Latency requirements are strict (sub-400ms end-to-end).
- Volume is high enough for cost differences to matter.
- Enterprise SIP infrastructure already exists.
- You need features webhooks don't expose cleanly (complex call flows, transfer, conference).
Use webhooks when:
- You're early in development.
- Volume is modest.
- You want simpler operations.
- You're happy with Twilio-style abstractions.
For the comparison, see SIP vs WebRTC for voice agents.
Operational metrics
Beyond call handle metrics, SIP-specific:
- Call setup success rate. % of INVITEs that become answered calls.
- Post-dial delay (PDD). Time from INVITE to ring.
- Call drop rate. % of calls dropping mid-call.
- Jitter and packet loss. Audio quality indicators.
- MOS (Mean Opinion Score). Subjective call quality.
Related reading
- Twilio + Voice Agents: A Complete Guide
- How to Integrate Voice Agents with a Custom REST API
- Sending Voice Agent Transcripts to Slack
- Connecting Voice Agents to Snowflake or BigQuery
- How to Port a Phone Number to Your Voice Agent
FAQ
Do I need to understand SIP to use voice AI? Not for managed deployments. For enterprise at scale, yes.
What's a reasonable CCC to start? Depends on call volume. At 10 calls/minute avg, with 3-minute avg duration, you need ~30 CCC. Plus 50% headroom.
Can I self-host SIP? Yes, with FreeSWITCH or Asterisk. Non-trivial operationally.
What about G.711 vs Opus for voice AI? G.711 is simpler (uncompressed). Opus is better quality at lower bandwidth. Modern voice AI works with either.
How do we handle call recording over SIP? Tap the RTP stream at the SBC or the voice AI layer. Compliance implications apply.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems β text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all βOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
SIP vs WebRTC for Voice Agents
SIP and WebRTC are the two dominant technologies for real-time voice in 2026. Most voice agent deployments use one, the other, or both. Deciding which to use for a given integration depends on where the call originates, what network conditions you expect, and how much controlβ¦
How to Integrate Voice Agents with a Custom REST API
Most voice agent integrations are with off-the-shelf systems β Salesforce, HubSpot, Zendesk, Stripe. But eventually every production deployment needs to integrate with a custom internal API β the billing system, the proprietary order management, the ops dashboard that only yourβ¦
Sending Voice Agent Transcripts to Slack
Slack is where most teams live in 2026, and for voice agent deployments, getting call transcripts and key events into Slack closes a critical ops loop. Escalations land in the right channel with context. QA reviews happen where the team already works.
Voice AI, twice a month.
Get the best of the SIMBA resources hub β new articles, trend notes, and operator guides. No spam.
