The Real Cost of a Voice Agent Conversation
The marketing pages will tell you a voice agent costs "fractions of a cent per minute." The reality is more interesting and more variable. Once you account for telephony, STT, LLM, TTS, and the long tail of operations, a typical 3-minute support call lands somewhere between…
The marketing pages will tell you a voice agent costs "fractions of a cent per minute." The reality is more interesting and more variable. Once you account for telephony, STT, LLM, TTS, and the long tail of operations, a typical 3-minute support call lands somewhere between $0.12 and $0.50. Below that there's a real floor; above that you're probably overpaying.
This piece is the actual cost breakdown, by component, with realistic numbers you can plan against in 2026.
TL;DR
- Per-minute voice agent cost in 2026: roughly $0.04–$0.15 in raw infrastructure.
- A 3-minute support call typically costs $0.12–$0.45 all-in.
- The biggest single line item is usually TTS, followed by LLM, then telephony.
- Per-minute cost is not the right number to optimize. Per-resolved-issue cost is.
The four cost layers
A voice agent's per-call cost has four components:
1. Telephony
Carrying the audio. Typical numbers:
- Twilio voice (US/CA): $0.0085/min inbound, $0.013/min outbound for a typical mix
- SIP trunking: $0.005–$0.015/min depending on volume and provider
- WebRTC (no PSTN): Effectively free aside from your bandwidth
For a 3-minute inbound call on Twilio, telephony is roughly $0.025.
2. Speech-to-text (STT)
Streaming STT pricing in 2026:
- Deepgram: $0.0043/min for streaming
- AssemblyAI: $0.005/min
- OpenAI Whisper API: $0.006/min
- Self-hosted: ~$0.001/min compute (but adds engineering overhead)
A 3-minute call: $0.013–$0.018.
3. Large language model (LLM)
This is the most variable. Depends on model choice and conversation complexity:
- GPT-4o-mini at typical voice loads: $0.01–$0.03/min
- Claude Haiku: $0.012–$0.025/min
- Gemini 2.0 Flash: $0.008–$0.02/min
- GPT-4o (frontier): $0.04–$0.10/min
- Self-hosted Llama 3.3 8B: $0.005–$0.015/min
A 3-minute call with a mid-sized model: $0.03–$0.09.
4. Text-to-speech (TTS)
Often the biggest single line item:
- Simba Flash: $0.10–$0.18 per 1,000 characters
- Simba Multilingual v2: $0.30 per 1,000 characters
- OpenAI TTS: $0.015 per 1,000 characters
- Cartesia Sonic: $0.05–$0.10 per 1,000 characters
- PlayHT: $0.04–$0.08 per 1,000 characters
A 3-minute call with the agent speaking ~50% of the time generates roughly 2,000–3,000 characters of TTS. Depending on provider: $0.03–$0.90.
The huge range reflects that premium TTS for brand voice cloning is significantly more expensive than commodity TTS.
Putting it together
A realistic per-call cost breakdown for a 3-minute inbound support call using mid-tier components:
| Component | Cost |
|---|---|
| Telephony (Twilio, US) | $0.025 |
| STT (Deepgram streaming) | $0.013 |
| LLM (GPT-4o-mini) | $0.045 |
| TTS (Simba Flash, 2,500 chars) | $0.30 |
| Total | ~$0.38 |
Premium build (frontier LLM, premium TTS): closer to $0.80–$1.20/call. Cost-optimized build (self-hosted everything): closer to $0.05–$0.15/call.
What's not in the breakdown
A few hidden costs that matter at scale:
Engineering time. Building, deploying, and maintaining the agent. For a single agent on a managed platform, expect ~0.25 FTE. For a complex multi-agent deployment, more.
Knowledge base hosting. RAG over a large doc set costs storage + embedding refreshes. Usually small but real.
Logging and analytics. Storing transcripts, tool-call logs, audit trails. Maybe $0.001/call. Negligible per call but adds up at million-call scale.
Monitoring. Pinging endpoints, watching error rates, paging on issues. Pennies per day per agent.
Compliance and legal. Disclosure language reviews, recording consent infrastructure, occasional audits. Spike costs but real.
Per-call vs per-resolution
The most common cost mistake: optimizing per-call cost without watching per-resolution cost.
A scenario:
- Agent A: $0.20/call, 50% resolution rate → $0.40 per resolved issue + 50% of calls still need a human ($5/call) = $2.70/resolved issue
- Agent B: $0.50/call, 80% resolution rate → $0.625 per resolved issue + 20% need a human = $1.625/resolved issue
Agent B costs 2.5x more per call but 40% less per resolved issue. The cheap-per-call agent is actually more expensive in the system view.
This is the math you should be doing, not "per minute cost." More on this in how to calculate ROI for AI customer support.
How to bring costs down (in order of impact)
If your per-call cost is high and you want to bring it down:
1. Switch TTS provider. Premium TTS is often the largest line item. Cartesia or PlayHT can be 5–10x cheaper than Simba Multilingual v2 for similar perceived quality on most voices.
2. Use a smaller LLM. GPT-4o-mini, Claude Haiku, Gemini Flash all run voice workloads well at 30–50% the cost of frontier models. The reasoning gap doesn't show up on most voice tasks.
3. Compress the system prompt. Shorter prompts = fewer input tokens per turn. Many production agents have 4,000-token system prompts that could be 1,500.
4. Cache aggressively. Prompt caching for the static system prompt cuts input cost 50–80%. Most major LLM providers support it; turn it on.
5. Self-host if volume justifies it. Above 100k minutes/month, self-hosted Llama on rented GPUs starts to beat hosted APIs on cost — at the price of operational complexity.
What costs aren't going down
A short list of things that aren't getting cheaper:
- Telephony per-minute. Twilio's pricing has been flat for years.
- Premium voice cloning. Brand-voice TTS is a premium product.
- Compliance overhead. Disclosure, recording consent, opt-out lists — all add fixed costs.
Cost vs quality
Be careful chasing the cheapest stack. The components interact:
- Cheaper TTS often has higher latency, which costs you in user experience.
- Smaller LLMs can fail on harder turns, which costs you in escalation rate.
- Self-hosted everything costs you in engineering time and operational risk.
The right cost target is "the cheapest stack that hits your quality bar," not "the cheapest stack."
Related reading
- What Is a Voice Agent? A 2026 Primer
- First-Time Builder's Guide to Voice Agents
- Why Voice AI Will Transform Phone Channels by 2030
- Voice Agent Use Cases: A Field Guide
- Synchronous vs Asynchronous Voice Agents
FAQ
Why is TTS often the biggest line item? Premium voices are expensive. The neural models that produce human-quality speech cost more to run than the LLM in many cases.
Can I skip TTS and just use OpenAI's audio mode? You can — and it's a single bill. The trade-off is less observability and harder customization. For a simple agent, it's fine; for an enterprise build, the per-component approach gives you more control.
What about edge cases like a 30-second call vs a 30-minute call? Costs scale roughly linearly with duration. Expect a 30-minute call to cost ~$3–$5 all-in.
How does outbound calling compare to inbound? Outbound is slightly more expensive due to higher per-minute telephony rates and more time spent on voicemail handling, dial attempts, etc.
Is per-minute pricing fair? Mostly yes. The components scale with audio time. The exception is LLM cost, which scales with token count (longer turns = more tokens = more cost), so a chatty agent costs more than a terse one even at the same duration.

Rohan Pavuluri builds SIMBA Voice Agents at Speechify. Previously, he founded and led Upsolve, the largest nonprofit in the United States serving low-income Americans through technology. He writes about real-world voice-agent deployments — customer support, outbound sales, AI receptionists — and the practical product, design, and operational lessons that actually move the needle.
More from Rohan Pavuluri
View all →SIMBA vs Avoca: Which AI Voice Agent Platform Is Right for Your Service Business?
Avoca raised $125M at a $1B valuation for home services voice AI. SIMBA takes a different approach — horizontal platform, published pricing, IVR navigation, and a dedicated engineer for every customer.
Voice AI for Commercial Real Estate: Leasing, Tenant Services, and Property Operations
Commercial real estate has distinct communication patterns from residential. Voice AI handles leasing inquiries, building ops, CAM questions, and broker qualification across office, retail, and industrial.
Voice Agents for Tenant Communication: Maintenance, Rent, and Lease Management at Scale
Managing tenant communication at scale breaks at about 200 units per property manager. Voice agents handle the entire lifecycle — inquiries, applications, maintenance, rent, renewals, and move-outs.
Related reading
First-Time Builder's Guide to Voice Agents
Building your first voice agent is mostly about resisting the urge to overengineer. You don't need to compare 8 LLMs. You don't need to design a multi-agent architecture. You need to get a single bounded agent on the phone, listen to it talk to real humans, and iterate.
Why Voice AI Will Transform Phone Channels by 2030
The phone is not going away. Despite a decade of "the phone is dying" predictions, U.S. consumers still place over 30 billion service calls a year. What's changing is what answers them.
Voice Agent Use Cases: A Field Guide
The "voice AI for customer service" pitch has gotten so widespread that it's hard to remember how many specific use cases live underneath it. Some are mature and ready to deploy. Some are still painful.
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
