🎙️ Voice AI Fundamentals

The Real Cost of a Voice Agent Conversation

The marketing pages will tell you a voice agent costs "fractions of a cent per minute." The reality is more interesting and more variable. Once you account for telephony, STT, LLM, TTS, and the long tail of operations, a typical 3-minute support call lands somewhere between…

Rohan Pavuluri
Rohan Pavuluri
January 5, 2026 · 6 min read
Speechify

The marketing pages will tell you a voice agent costs "fractions of a cent per minute." The reality is more interesting and more variable. Once you account for telephony, STT, LLM, TTS, and the long tail of operations, a typical 3-minute support call lands somewhere between $0.12 and $0.50. Below that there's a real floor; above that you're probably overpaying.

This piece is the actual cost breakdown, by component, with realistic numbers you can plan against in 2026.

TL;DR

  • Per-minute voice agent cost in 2026: roughly $0.04–$0.15 in raw infrastructure.
  • A 3-minute support call typically costs $0.12–$0.45 all-in.
  • The biggest single line item is usually TTS, followed by LLM, then telephony.
  • Per-minute cost is not the right number to optimize. Per-resolved-issue cost is.

The four cost layers

A voice agent's per-call cost has four components:

1. Telephony

Carrying the audio. Typical numbers:

  • Twilio voice (US/CA): $0.0085/min inbound, $0.013/min outbound for a typical mix
  • SIP trunking: $0.005–$0.015/min depending on volume and provider
  • WebRTC (no PSTN): Effectively free aside from your bandwidth

For a 3-minute inbound call on Twilio, telephony is roughly $0.025.

2. Speech-to-text (STT)

Streaming STT pricing in 2026:

  • Deepgram: $0.0043/min for streaming
  • AssemblyAI: $0.005/min
  • OpenAI Whisper API: $0.006/min
  • Self-hosted: ~$0.001/min compute (but adds engineering overhead)

A 3-minute call: $0.013–$0.018.

3. Large language model (LLM)

This is the most variable. Depends on model choice and conversation complexity:

  • GPT-4o-mini at typical voice loads: $0.01–$0.03/min
  • Claude Haiku: $0.012–$0.025/min
  • Gemini 2.0 Flash: $0.008–$0.02/min
  • GPT-4o (frontier): $0.04–$0.10/min
  • Self-hosted Llama 3.3 8B: $0.005–$0.015/min

A 3-minute call with a mid-sized model: $0.03–$0.09.

4. Text-to-speech (TTS)

Often the biggest single line item:

  • Simba Flash: $0.10–$0.18 per 1,000 characters
  • Simba Multilingual v2: $0.30 per 1,000 characters
  • OpenAI TTS: $0.015 per 1,000 characters
  • Cartesia Sonic: $0.05–$0.10 per 1,000 characters
  • PlayHT: $0.04–$0.08 per 1,000 characters

A 3-minute call with the agent speaking ~50% of the time generates roughly 2,000–3,000 characters of TTS. Depending on provider: $0.03–$0.90.

The huge range reflects that premium TTS for brand voice cloning is significantly more expensive than commodity TTS.

Putting it together

A realistic per-call cost breakdown for a 3-minute inbound support call using mid-tier components:

ComponentCost
Telephony (Twilio, US)$0.025
STT (Deepgram streaming)$0.013
LLM (GPT-4o-mini)$0.045
TTS (Simba Flash, 2,500 chars)$0.30
Total~$0.38

Premium build (frontier LLM, premium TTS): closer to $0.80–$1.20/call. Cost-optimized build (self-hosted everything): closer to $0.05–$0.15/call.

What's not in the breakdown

A few hidden costs that matter at scale:

Engineering time. Building, deploying, and maintaining the agent. For a single agent on a managed platform, expect ~0.25 FTE. For a complex multi-agent deployment, more.

Knowledge base hosting. RAG over a large doc set costs storage + embedding refreshes. Usually small but real.

Logging and analytics. Storing transcripts, tool-call logs, audit trails. Maybe $0.001/call. Negligible per call but adds up at million-call scale.

Monitoring. Pinging endpoints, watching error rates, paging on issues. Pennies per day per agent.

Compliance and legal. Disclosure language reviews, recording consent infrastructure, occasional audits. Spike costs but real.

Per-call vs per-resolution

The most common cost mistake: optimizing per-call cost without watching per-resolution cost.

A scenario:

  • Agent A: $0.20/call, 50% resolution rate → $0.40 per resolved issue + 50% of calls still need a human ($5/call) = $2.70/resolved issue
  • Agent B: $0.50/call, 80% resolution rate → $0.625 per resolved issue + 20% need a human = $1.625/resolved issue

Agent B costs 2.5x more per call but 40% less per resolved issue. The cheap-per-call agent is actually more expensive in the system view.

This is the math you should be doing, not "per minute cost." More on this in how to calculate ROI for AI customer support.

How to bring costs down (in order of impact)

If your per-call cost is high and you want to bring it down:

1. Switch TTS provider. Premium TTS is often the largest line item. Cartesia or PlayHT can be 5–10x cheaper than Simba Multilingual v2 for similar perceived quality on most voices.

2. Use a smaller LLM. GPT-4o-mini, Claude Haiku, Gemini Flash all run voice workloads well at 30–50% the cost of frontier models. The reasoning gap doesn't show up on most voice tasks.

3. Compress the system prompt. Shorter prompts = fewer input tokens per turn. Many production agents have 4,000-token system prompts that could be 1,500.

4. Cache aggressively. Prompt caching for the static system prompt cuts input cost 50–80%. Most major LLM providers support it; turn it on.

5. Self-host if volume justifies it. Above 100k minutes/month, self-hosted Llama on rented GPUs starts to beat hosted APIs on cost — at the price of operational complexity.

What costs aren't going down

A short list of things that aren't getting cheaper:

  • Telephony per-minute. Twilio's pricing has been flat for years.
  • Premium voice cloning. Brand-voice TTS is a premium product.
  • Compliance overhead. Disclosure, recording consent, opt-out lists — all add fixed costs.

Cost vs quality

Be careful chasing the cheapest stack. The components interact:

  • Cheaper TTS often has higher latency, which costs you in user experience.
  • Smaller LLMs can fail on harder turns, which costs you in escalation rate.
  • Self-hosted everything costs you in engineering time and operational risk.

The right cost target is "the cheapest stack that hits your quality bar," not "the cheapest stack."

FAQ

Why is TTS often the biggest line item? Premium voices are expensive. The neural models that produce human-quality speech cost more to run than the LLM in many cases.

Can I skip TTS and just use OpenAI's audio mode? You can — and it's a single bill. The trade-off is less observability and harder customization. For a simple agent, it's fine; for an enterprise build, the per-component approach gives you more control.

What about edge cases like a 30-second call vs a 30-minute call? Costs scale roughly linearly with duration. Expect a 30-minute call to cost ~$3–$5 all-in.

How does outbound calling compare to inbound? Outbound is slightly more expensive due to higher per-minute telephony rates and more time spent on voicemail handling, dial attempts, etc.

Is per-minute pricing fair? Mostly yes. The components scale with audio time. The exception is LLM cost, which scales with token count (longer turns = more tokens = more cost), so a chatty agent costs more than a terse one even at the same duration.

Rohan Pavuluri
Rohan Pavuluri
Building SIMBA Voice Agents

Rohan Pavuluri builds SIMBA Voice Agents at Speechify. Previously, he founded and led Upsolve, the largest nonprofit in the United States serving low-income Americans through technology. He writes about real-world voice-agent deployments — customer support, outbound sales, AI receptionists — and the practical product, design, and operational lessons that actually move the needle.

More from Rohan Pavuluri

View all →

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.