Retrieval-Augmented Generation for Voice Agents
RAG — retrieval-augmented generation — is the standard pattern for grounding an LLM in a specific knowledge base. For voice agents, RAG works the same as for chatbots, with one crucial difference: every millisecond of retrieval latency shows up in the conversation.
RAG — retrieval-augmented generation — is the standard pattern for grounding an LLM in a specific knowledge base. For voice agents, RAG works the same as for chatbots, with one crucial difference: every millisecond of retrieval latency shows up in the conversation. This is how to do it without making the agent feel slow.
TL;DR
- RAG retrieves relevant docs from your knowledge base and stuffs them into the LLM prompt.
- For voice, retrieval has to happen inside your latency budget — usually under 200ms.
- Don't retrieve on every turn. Gate retrieval on the LLM's own decision.
- The retrieval quality matters more than the number of docs retrieved.
What RAG does
The flow:
- Your knowledge base (docs, FAQs, policies, product info) gets chunked and embedded.
- At call time, the user's question (or the conversation so far) gets embedded.
- A vector similarity search returns the top N most relevant chunks.
- Those chunks get prepended to the LLM prompt.
- The LLM generates a reply grounded in the retrieved content.
For voice agents, this lets the agent answer "what's your return policy?" or "does the X300 come in green?" without that info being baked into the system prompt.
The latency problem
A naive RAG implementation does retrieval on every turn:
- Embed the user's question: ~50ms
- Vector search: ~50–300ms
- Send to LLM with retrieved context: ~50ms
Total added latency: 150–400ms per turn. Often the difference between a snappy agent and a sluggish one.
Three mitigations:
1. Gate retrieval on need. Don't retrieve unless the turn requires it. The LLM can decide via a function call: search_knowledge(query). Greetings, confirmations, and follow-ups skip retrieval entirely.
2. Use a fast vector store. Pinecone serverless, Qdrant, Weaviate, or pgvector with proper indexing all return in under 100ms for moderate-size indexes. If your retrieval is taking 300ms+, you have an indexing or scale problem.
3. Pre-filter by metadata. If your knowledge base has clear categories (product, policy, FAQ), filter by category before similarity search. Smaller search space, faster results.
What to retrieve
The retrieval target should be small, focused chunks — not full documents. Typical chunk size: 200–500 tokens. Larger chunks dilute the relevance score; smaller chunks lose context.
Number of chunks to retrieve: 3–5 for most voice use cases. More than that wastes prompt tokens without improving accuracy.
Embedding model choice
Embedding quality matters. Three reasonable defaults in 2026:
- OpenAI text-embedding-3-small — cheap, fast, good for English.
- OpenAI text-embedding-3-large — more expensive, better quality, useful for nuanced retrieval.
- Cohere embed-multilingual-v3 — best for multilingual.
Test on your specific corpus. The "best" embedding model depends on what your docs look like.
What goes wrong
Common RAG failures in voice agents:
Retrieving irrelevant chunks. The model uses them anyway, gives a weird answer.
Retrieving outdated info. Your knowledge base hasn't been updated; the agent says yesterday's truth confidently.
Not retrieving when needed. The model didn't think to call search_knowledge. Common when the trigger isn't an obvious question.
Retrieving on every turn. Added latency on greetings and chitchat. Fix: gating.
Long context dilution. Retrieved 10 chunks; model picks the wrong one. Fix: retrieve fewer chunks; rerank.
Knowledge base hygiene
The most underrated lever in RAG quality. A small clean knowledge base usually beats a large messy one.
Practices:
Single source of truth. Don't have the same policy in 3 different docs with subtle variations. Pick one canonical version.
Date your docs. "Updated 2026-04-01" so the model knows what's current.
Strip the noise. Remove repeated headers, navigation, "Related articles" lists. Embeddings work better on clean prose.
Test your retrieval. Ask 50 representative questions; verify the right chunks come back. Tune as needed.
Implementation patterns
A simple RAG implementation for voice:
// Tool definition the LLM can call
{
name: "search_knowledge_base",
description: "Search the company knowledge base for relevant
policy or product information. Use when the caller asks about
product features, return policy, hours, or other documented
topics.",
parameters: {
query: { type: "string", description: "Search query" }
}
}
// Implementation
async function search_knowledge_base({ query }) {
const embedding = await embed(query);
const results = await vectorStore.search(embedding, { limit: 4 });
return results.map(r => r.content).join("\n\n");
}
The LLM decides when to call this. Most production agents see retrieval fire on roughly 30–50% of turns.
When RAG isn't the right tool
Three cases:
Highly structured queries. "What's my order status?" doesn't need RAG; it needs a function call to your order system.
Static facts in the system prompt. Your business name, hours, address — bake these into the prompt directly.
Real-time data. RAG is for relatively stable knowledge. For real-time data (current inventory, today's appointments), use direct API calls.
Evaluation
Retrieval quality eval:
- Pick 50 representative caller questions.
- For each, manually identify which chunks should be retrieved.
- Run your retrieval; compare top-K against the ground truth.
- Track precision@k and recall@k.
Run this whenever you change the embedding model, chunk size, or knowledge base.
Related reading
- Building a Conversation Memory Layer for Voice Agents
- The Role of Embeddings in Voice Agent Knowledge
- How to Stop a Voice Agent from Hallucinating
- How to Give a Voice Agent Long-Term Memory
- How Large Language Models Power Voice Agents
FAQ
Do I need a vector database? For more than ~10,000 chunks, yes. Below that, you can do retrieval over a flat embedded array.
How fresh should my knowledge base be? Re-index after any significant doc update. For policy changes, immediately. For minor wording, weekly is fine.
Can RAG hallucinate? The retrieval doesn't hallucinate; the LLM still can. Mitigation: prompt the LLM to "only use the provided context to answer; if the context doesn't contain the answer, say so."
What's the cost of RAG? Embedding cost (one-time per doc) + retrieval cost (~$0.0001/query) + extra LLM input tokens (~50–500 per query). Usually negligible.
Should I rerank retrieved chunks? For voice, usually not — the latency hit isn't worth it. For chat agents with looser latency budgets, reranking can improve quality 10–20%.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all →Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Reducing Repeat Contacts with Better Knowledge Bases
Repeat contacts — when a customer comes back about the same issue — are often a knowledge base problem in disguise. The AI agent didn't have the answer the first time, so it gave a partial response, escalated, or punted. The customer comes back.
Building a Conversation Memory Layer for Voice Agents
The model has no memory beyond what you put in its context window. For a 5-minute support call this is fine. For longer calls, multi-call interactions, or agents that need to remember preferences across sessions, you need an explicit memory layer.
The Role of Embeddings in Voice Agent Knowledge
Embeddings are the numerical representations of text that make retrieval-augmented generation work. Most voice agent builders never have to think about embeddings directly — their platform handles them.
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
