The Role of Embeddings in Voice Agent Knowledge
Embeddings are the numerical representations of text that make retrieval-augmented generation work. Most voice agent builders never have to think about embeddings directly — their platform handles them.
Embeddings are the numerical representations of text that make retrieval-augmented generation work. Most voice agent builders never have to think about embeddings directly — their platform handles them. But understanding what's happening underneath helps you debug retrieval failures and pick the right tools when defaults break down.
TL;DR
- Embeddings turn text into vectors so similar text has similar vectors.
- They're the backbone of RAG: your knowledge base gets embedded; user queries get embedded; you retrieve the closest matches.
- Most voice agents in 2026 use OpenAI text-embedding-3 or Cohere multilingual.
- Embedding model choice matters more than you'd think on edge cases (multilingual, technical jargon, code).
What embeddings are
A model takes a piece of text and outputs a vector of typically 768–3072 numbers. Texts with similar meaning produce similar vectors (small distance in vector space).
Example (illustrative, not real):
- "How do I cancel my subscription?" → [0.12, -0.34, 0.05, ...]
- "What's the cancellation process?" → [0.11, -0.33, 0.06, ...]
- "What's your return policy?" → [0.45, 0.21, -0.18, ...]
The first two questions are semantically close; their vectors are close. The third is a different topic; its vector is far.
This is what makes retrieval work. You don't search for keyword matches; you search for vector proximity.
How they're used in voice agents
The flow:
At index time:
- Your knowledge base gets chunked into 200–500 token pieces.
- Each chunk gets embedded → vector.
- Vectors stored in a vector database (Pinecone, Qdrant, pgvector, etc.).
At query time:
- The user's question (or a generated query) gets embedded.
- Vector database returns the K closest chunks.
- Those chunks go into the LLM prompt.
- The LLM answers grounded in the retrieved content.
For more, see retrieval-augmented generation for voice agents.
Picking an embedding model
Common defaults in 2026:
| Model | Dimensions | Cost / 1M tokens | Best for |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | $0.02 | English, general use |
| OpenAI text-embedding-3-large | 3072 | $0.13 | Higher quality English |
| Cohere embed-multilingual-v3 | 1024 | $0.10 | Multilingual |
| Voyage voyage-3 | 1024 | $0.06 | Code, technical content |
| BGE-large (open-source) | 1024 | self-hosted | Cost-sensitive English |
Default: OpenAI text-embedding-3-small. Switch if you have a specific reason.
What goes wrong
Three common failure modes:
1. Bad chunks. If your chunks are too long, embeddings get diluted. Too short, you lose context. 200–500 tokens per chunk is the sweet spot.
2. Embedding model mismatch. Using an English-tuned model for multilingual content drops accuracy noticeably. Match the model to your content language.
3. Stale embeddings. Your knowledge base updated but you didn't re-embed. The vectors are out of date.
For voice agents specifically, embedding latency matters too. Most embedding APIs return in 50–200ms, but cold starts can hit 500ms+. Keep them warm.
Re-embedding cadence
When to re-embed:
- After significant doc updates (always).
- After embedding model upgrade (rare; do it in a controlled migration).
- After chunking strategy change (test on a small batch first).
Most teams re-embed nightly or weekly. Per-chunk embedding cost is small enough that this isn't expensive.
Hybrid retrieval
Pure vector search misses some queries. A common pattern: combine vector search with keyword search (BM25). For each query:
- Run vector search → top 10 candidates.
- Run keyword search → top 10 candidates.
- Merge and rerank.
Hybrid retrieval improves precision on queries with specific terms (product names, error codes) that pure embeddings sometimes miss.
For voice agents, hybrid retrieval adds latency. Use it only if pure vector search is showing measurable gaps.
Reranking
After retrieval, a reranker can re-score the top K against the query for better precision. Common rerankers: Cohere Rerank, Voyage Rerank, BGE reranker.
Adds 50–150ms of latency. Improves precision by 10–20% in most setups.
For voice, the latency hit usually isn't worth it. For chat with looser latency budgets, rerankers earn their cost.
Domain-tuned embeddings
For specialized content (legal, medical, code), generic embedding models miss nuance. Two options:
Fine-tuned embeddings. Take an open-source model and fine-tune on your domain. Real lift on domain-specific queries; significant engineering effort.
Domain-specific embedding models. Voyage and Cohere both have legal/medical specialized models. Often the easier path.
For most B2B voice agents, generic embeddings are fine. Domain tuning matters most for verticals with specialized vocabulary.
What about end-to-end audio embeddings?
A research direction worth knowing about: instead of audio → text → embedding, embed the audio directly. This skips the STT step.
Currently more research than production. Quality is comparable to text embeddings but ecosystem maturity is much lower. Watch for the next 1–2 years.
Related reading
- Building a Conversation Memory Layer for Voice Agents
- How to Give a Voice Agent Long-Term Memory
- How Large Language Models Power Voice Agents
- Designing Voice Agents That Ask Better Questions
- Open-Source vs Closed-Source LLMs for Voice Agents
FAQ
How many dimensions should my embeddings have? Use whatever your model produces. Lower-dimensional models are faster but lossy. Higher-dimensional ones are more precise but slower to search.
Can I mix embedding models? Within a single index, no — vectors must be from the same model to be comparable. Across indexes, fine.
What's the difference between embeddings and tokens? Tokens are what the LLM operates on (words/subwords); embeddings are the vector representation of text. Different layers.
Do I need a vector database? For 10,000+ chunks, yes. Below that, in-memory works.
Is RAG going away? No. Even with much larger context windows, RAG is more reliable and cheaper than dumping everything into the prompt.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all →Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Building a Conversation Memory Layer for Voice Agents
The model has no memory beyond what you put in its context window. For a 5-minute support call this is fine. For longer calls, multi-call interactions, or agents that need to remember preferences across sessions, you need an explicit memory layer.
How to Give a Voice Agent Long-Term Memory
By default, voice agents have no memory beyond the current call. The caller hangs up, the agent forgets everything. For many use cases this is fine. For loyalty-driven businesses where the same caller comes back repeatedly, it's a missed opportunity.
Designing Voice Agents That Ask Better Questions
A voice agent that asks bad questions wastes the caller's time and produces bad data. Good questions feel natural and capture what you need in fewer turns.
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
