Building a Conversation Memory Layer for Voice Agents
The model has no memory beyond what you put in its context window. For a 5-minute support call this is fine. For longer calls, multi-call interactions, or agents that need to remember preferences across sessions, you need an explicit memory layer.
The model has no memory beyond what you put in its context window. For a 5-minute support call this is fine. For longer calls, multi-call interactions, or agents that need to remember preferences across sessions, you need an explicit memory layer. The shape of that layer is more interesting than it first looks.
TL;DR
- Memory has three scopes: in-turn (the prompt), in-call (the running transcript), cross-call (persistent storage).
- For long calls, sliding window + periodic summarization beats trying to fit everything.
- For cross-call memory, key on the caller's stable identifier (phone number, account ID).
- Don't store everything. Store what's actionable.
The three scopes
In-turn. The contents of the LLM's prompt for a single turn. Includes system prompt, recent conversation history, retrieved RAG, and function results. Ephemeral.
In-call. State that persists across turns within a single call. Most platforms include this by default โ the running transcript flows into each turn's prompt.
Cross-call. State that persists across separate calls from the same caller. Requires a database keyed on caller identity.
The scopes are layered: in-turn pulls from in-call which pulls from cross-call.
In-call memory tactics
For most calls (under 15 turns, under 10 minutes), the default in-call memory is fine. The full transcript fits in the prompt; the model can reason over the whole thing.
For longer calls, three patterns:
Sliding window
Keep the last N turns verbatim. Drop or summarize older ones.
[System prompt]
[Older turns summarized: "Caller introduced themselves as
Sarah, asked about her order #4521. Confirmed shipping
delay. Discussed compensation options."]
[Last 6 turns verbatim]
[Current turn]
Pros: bounded prompt size; preserves recent detail. Cons: loses precise older context; older summary can drift.
Periodic re-summarization
Every N turns, the orchestration layer asks the model to re-summarize the call so far. The summary replaces the full older transcript.
Pros: summaries get refined as the call progresses. Cons: extra LLM cost; summary quality varies.
Structured slot tracking
Maintain explicit fields for important info captured during the call:
{
caller_name: "Sarah Chen",
account_id: "1976432",
intent: "shipping_delay_complaint",
resolution_status: "in_progress",
promised_action: "supervisor will call back tomorrow"
}
Inject into the prompt instead of relying on the model to remember.
Pros: precise; queryable; good for evals. Cons: requires schema design; manual upkeep.
In practice, most production agents combine sliding window + structured slots.
Cross-call memory
When the same caller comes back, ideally the agent picks up where they left off. Implementation:
Identify the caller. Phone number is the simplest key. Account ID if you have it.
Store call summaries. After each call, generate a 2โ3 sentence summary; persist keyed on caller ID.
Surface relevant memory at call start. Pull recent summaries; inject into the system prompt for the new call.
Update on new info. As the new call progresses, update preferences, resolved issues, and any ongoing threads.
For the deeper take, see how to give a voice agent long-term memory.
What to remember
Resist storing everything. Focus on:
- Caller identity (name, contact info)
- Recent intents (what they called about)
- Open commitments (things you promised to do)
- Strong preferences (time, channel, style)
- Resolved issues (so the agent doesn't ask again)
Don't store:
- Full transcripts of every call (too noisy; use summaries)
- Sensitive PII not needed for the task
- Agent's interpretation of caller's mood (creepy and often wrong)
The privacy angle
Cross-call memory is a privacy commitment. Best practices:
Disclose. Tell users you remember. "I see we talked last week about your prescription."
Allow opt-out. "Want me to forget our previous conversations?"
Honor deletion. When a user requests deletion, actually delete from the memory store, not just the transcript.
Limit retention. 90 days for most use cases. Longer for loyalty programs with explicit consent.
Implementation choices
A few real options:
Database-backed structured store. Postgres or similar. Best for structured slots, customer profiles, preferences.
Vector-backed memory. Embed past interactions; retrieve relevant ones at call start. Good for unstructured "things that came up before."
Hybrid. Structured fields in Postgres + vector recall for fuzzier patterns. Most scalable.
For first builds, start with database-backed. Add vector recall when you have enough interaction history to make it useful.
Eval for memory
Memory layers add complexity; they need their own evals.
Test cases:
- Caller calls back about the same issue. Does the agent recognize it?
- Caller calls about a new issue. Does the agent avoid bringing up irrelevant old context?
- Caller asks the agent to forget. Does it actually forget?
- Caller's preferences change. Does the memory update?
Run these on every memory layer change.
When skipping memory is right
A few cases where you should not build cross-call memory:
- One-off transactional calls (order status, password reset).
- High-volume B2C anonymous calls.
- Use cases where compliance makes long retention risky.
- First builds โ get the agent shipping; add memory later if needed.
Related reading
- The Role of Embeddings in Voice Agent Knowledge
- How Large Language Models Power Voice Agents
- Designing Voice Agents That Ask Better Questions
- Open-Source vs Closed-Source LLMs for Voice Agents
- How LLMs Decide What to Say Next in a Voice Conversation
FAQ
How long should call summaries be? 2โ3 sentences. Captures the essentials without bloating future prompts.
Can the model write its own summary at call end? Yes โ most platforms do this automatically. Just prompt the model to "write a 2-sentence summary of this call: ..."
What if my use case has no caller identity? Skip cross-call memory. In-call memory is still useful.
Does memory hurt latency? Marginally โ adds 100โ200 tokens to the prompt. Negligible compared to other latency drivers.
What about adversarial prompt injection through memory? A real risk. Sanitize stored summaries; don't let user input flow into future prompts unfiltered.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems โ text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all โOpen-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
The Role of Embeddings in Voice Agent Knowledge
Embeddings are the numerical representations of text that make retrieval-augmented generation work. Most voice agent builders never have to think about embeddings directly โ their platform handles them.
How to Give a Voice Agent Long-Term Memory
By default, voice agents have no memory beyond the current call. The caller hangs up, the agent forgets everything. For many use cases this is fine. For loyalty-driven businesses where the same caller comes back repeatedly, it's a missed opportunity.
Designing Voice Agents That Ask Better Questions
A voice agent that asks bad questions wastes the caller's time and produces bad data. Good questions feel natural and capture what you need in fewer turns.
Voice AI, twice a month.
Get the best of the SIMBA resources hub โ new articles, trend notes, and operator guides. No spam.
