LLM marketing has been all about context window expansion — 128K, 200K, 1M, 2M tokens. For voice agents, this race mostly doesn't matter. Voice conversations rarely exceed 5,000 tokens of meaningful context. The constraint isn't the window size; it's how you manage what's inside it.

TL;DR

A typical voice call fits in 2,000–5,000 tokens of conversation history.
Larger context windows don't make voice agents better; tighter prompts do.
Latency and cost both scale with input tokens — using a 1M-token window when you need 5K is just expensive.
The practical context budget for a voice agent: 1,500-token system prompt + sliding window of recent turns + retrieved RAG context.

How much context a voice call actually uses

Concrete numbers for a typical 5-minute support call:

System prompt: 1,000–1,500 tokens (set once)
Conversation transcript: ~150 tokens per turn × 15 turns = 2,250 tokens
Retrieved RAG context (when used): ~500–1,500 tokens
Function call results: ~100–400 tokens

Total: 4,000–5,500 tokens. Well within any modern LLM's window.

Even a 30-minute call with extensive RAG hits maybe 15K tokens. Still small.

Why bigger windows don't help

Three reasons:

1. Recency bias. LLMs attend more strongly to recent context. A constraint mentioned 50K tokens ago carries less weight than one mentioned 500 tokens ago. Padding the prompt with old context can actually hurt.

2. Latency cost. TTFT scales with input length. Doubling the context doubles the latency penalty per turn. For voice, where every 100ms matters, extra context is expensive.

3. Cost. Input tokens cost money. Most pricing is per-token; a 50K-token prompt costs 10x what a 5K-token prompt does, on every turn.

What actually helps in voice

Context-related improvements that move the needle:

Tighter system prompt. Most production prompts have 30–50% fluff. Compress.

Sliding window. Keep last N turns verbatim; summarize older turns into a concise memory line. Bounds the conversation transcript size.

Smarter RAG. Retrieve fewer chunks but better ones. 3 well-chosen chunks beat 10 mediocre ones.

Function-call efficiency. Don't dump big function results into context. Summarize before injecting.

Prompt caching. Cache the static portion (system prompt) so input tokens get cheaper on every turn after the first. Major LLM providers support this; voice agents should always have it on.

When you actually need a big window

A few use cases where 100K+ context becomes useful for voice:

Long discovery calls (60+ minutes). Sales discovery, complex troubleshooting, multi-issue support. Even here, summarization usually beats brute-force context.

Multi-call memory. When the agent remembers prior calls and surfaces them. The total accumulated context can grow.

Massive RAG. When you genuinely need to retrieve and reason over many documents. Rare for voice.

For most voice deployments, a 32K-token context window is plenty.

What to actually optimize

Stop worrying about window size. Worry about:

TTFT (time to first token). This is what the caller feels. Smaller models with prompt caching often beat bigger models with bigger windows.

Function-call accuracy. Did the agent pick the right tool with the right arguments?

Recovery quality. When something goes wrong mid-call, does the agent handle it gracefully?

Latency p99. Median latency is a vanity metric. The slow tail is what kills user experience.

These all benefit from prompt discipline, not from bigger context windows.

A practical context budget

For most voice agents:

Component	Budget
System prompt	1,000–1,500 tokens (cached)
Recent turns (sliding window of last 8)	1,200–2,000 tokens
Older turns summarized	100–300 tokens
RAG retrieved context (when needed)	500–1,500 tokens
Function call results	100–400 tokens
Per-turn input total	~3,000–6,000 tokens

If your agent is using more than this, audit. You probably have bloat.

When to consider a really long-context model

Three signals:

Your average call exceeds 30 minutes.
You're already doing RAG and tight prompts and still missing context.
Your specific use case requires reasoning across a large doc set per turn.

If you don't hit all three, stick with mid-context models. Save the cost.

FAQ

Is a 1M-token window useful for anything in voice? Edge cases. Long sales calls with extensive doc reference. Multi-call memory at scale. Most agents don't need it.

Why does the marketing focus so much on context size? It's an easy benchmark to compare. Real-world value is more nuanced.

Does prompt caching work with all providers? Most major ones in 2026: OpenAI, Anthropic, Google. Some self-hosted models support it via specific runtimes.

What about reasoning models that use lots of context internally? Reasoning chains add latency more than they add useful context. For voice, prefer non-reasoning models.

Should I worry about hitting context limits? For typical voice agents, no. For 60-minute calls, build in summarization.

Why Context Windows Matter Less Than You Think for Voice

TL;DR

How much context a voice call actually uses

Why bigger windows don't help

What actually helps in voice

When you actually need a big window

What to actually optimize

A practical context budget

When to consider a really long-context model

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

Designing Voice Agents That Ask Better Questions

Open-Source vs Closed-Source LLMs for Voice Agents

How LLMs Decide What to Say Next in a Voice Conversation

Voice AI, twice a month.