🧠 Conversational AI & LLMs

Retrieval-Augmented Generation for Voice Agents

RAG — retrieval-augmented generation — is the standard pattern for grounding an LLM in a specific knowledge base. For voice agents, RAG works the same as for chatbots, with one crucial difference: every millisecond of retrieval latency shows up in the conversation.

Tyler Weitzman
Tyler Weitzman
January 16, 2026 · 5 min read
Speechify

RAG — retrieval-augmented generation — is the standard pattern for grounding an LLM in a specific knowledge base. For voice agents, RAG works the same as for chatbots, with one crucial difference: every millisecond of retrieval latency shows up in the conversation. This is how to do it without making the agent feel slow.

TL;DR

  • RAG retrieves relevant docs from your knowledge base and stuffs them into the LLM prompt.
  • For voice, retrieval has to happen inside your latency budget — usually under 200ms.
  • Don't retrieve on every turn. Gate retrieval on the LLM's own decision.
  • The retrieval quality matters more than the number of docs retrieved.

What RAG does

The flow:

  1. Your knowledge base (docs, FAQs, policies, product info) gets chunked and embedded.
  2. At call time, the user's question (or the conversation so far) gets embedded.
  3. A vector similarity search returns the top N most relevant chunks.
  4. Those chunks get prepended to the LLM prompt.
  5. The LLM generates a reply grounded in the retrieved content.

For voice agents, this lets the agent answer "what's your return policy?" or "does the X300 come in green?" without that info being baked into the system prompt.

The latency problem

A naive RAG implementation does retrieval on every turn:

  • Embed the user's question: ~50ms
  • Vector search: ~50–300ms
  • Send to LLM with retrieved context: ~50ms

Total added latency: 150–400ms per turn. Often the difference between a snappy agent and a sluggish one.

Three mitigations:

1. Gate retrieval on need. Don't retrieve unless the turn requires it. The LLM can decide via a function call: search_knowledge(query). Greetings, confirmations, and follow-ups skip retrieval entirely.

2. Use a fast vector store. Pinecone serverless, Qdrant, Weaviate, or pgvector with proper indexing all return in under 100ms for moderate-size indexes. If your retrieval is taking 300ms+, you have an indexing or scale problem.

3. Pre-filter by metadata. If your knowledge base has clear categories (product, policy, FAQ), filter by category before similarity search. Smaller search space, faster results.

What to retrieve

The retrieval target should be small, focused chunks — not full documents. Typical chunk size: 200–500 tokens. Larger chunks dilute the relevance score; smaller chunks lose context.

Number of chunks to retrieve: 3–5 for most voice use cases. More than that wastes prompt tokens without improving accuracy.

Embedding model choice

Embedding quality matters. Three reasonable defaults in 2026:

  • OpenAI text-embedding-3-small — cheap, fast, good for English.
  • OpenAI text-embedding-3-large — more expensive, better quality, useful for nuanced retrieval.
  • Cohere embed-multilingual-v3 — best for multilingual.

Test on your specific corpus. The "best" embedding model depends on what your docs look like.

What goes wrong

Common RAG failures in voice agents:

Retrieving irrelevant chunks. The model uses them anyway, gives a weird answer.

Retrieving outdated info. Your knowledge base hasn't been updated; the agent says yesterday's truth confidently.

Not retrieving when needed. The model didn't think to call search_knowledge. Common when the trigger isn't an obvious question.

Retrieving on every turn. Added latency on greetings and chitchat. Fix: gating.

Long context dilution. Retrieved 10 chunks; model picks the wrong one. Fix: retrieve fewer chunks; rerank.

Knowledge base hygiene

The most underrated lever in RAG quality. A small clean knowledge base usually beats a large messy one.

Practices:

Single source of truth. Don't have the same policy in 3 different docs with subtle variations. Pick one canonical version.

Date your docs. "Updated 2026-04-01" so the model knows what's current.

Strip the noise. Remove repeated headers, navigation, "Related articles" lists. Embeddings work better on clean prose.

Test your retrieval. Ask 50 representative questions; verify the right chunks come back. Tune as needed.

Implementation patterns

A simple RAG implementation for voice:

// Tool definition the LLM can call
{
  name: "search_knowledge_base",
  description: "Search the company knowledge base for relevant
  policy or product information. Use when the caller asks about
  product features, return policy, hours, or other documented
  topics.",
  parameters: {
    query: { type: "string", description: "Search query" }
  }
}

// Implementation
async function search_knowledge_base({ query }) {
  const embedding = await embed(query);
  const results = await vectorStore.search(embedding, { limit: 4 });
  return results.map(r => r.content).join("\n\n");
}

The LLM decides when to call this. Most production agents see retrieval fire on roughly 30–50% of turns.

When RAG isn't the right tool

Three cases:

Highly structured queries. "What's my order status?" doesn't need RAG; it needs a function call to your order system.

Static facts in the system prompt. Your business name, hours, address — bake these into the prompt directly.

Real-time data. RAG is for relatively stable knowledge. For real-time data (current inventory, today's appointments), use direct API calls.

Evaluation

Retrieval quality eval:

  1. Pick 50 representative caller questions.
  2. For each, manually identify which chunks should be retrieved.
  3. Run your retrieval; compare top-K against the ground truth.
  4. Track precision@k and recall@k.

Run this whenever you change the embedding model, chunk size, or knowledge base.

FAQ

Do I need a vector database? For more than ~10,000 chunks, yes. Below that, you can do retrieval over a flat embedded array.

How fresh should my knowledge base be? Re-index after any significant doc update. For policy changes, immediately. For minor wording, weekly is fine.

Can RAG hallucinate? The retrieval doesn't hallucinate; the LLM still can. Mitigation: prompt the LLM to "only use the provided context to answer; if the context doesn't contain the answer, say so."

What's the cost of RAG? Embedding cost (one-time per doc) + retrieval cost (~$0.0001/query) + extra LLM input tokens (~50–500 per query). Usually negligible.

Should I rerank retrieved chunks? For voice, usually not — the latency hit isn't worth it. For chat agents with looser latency budgets, reranking can improve quality 10–20%.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all →

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.