🧠 Conversational AI & LLMs

The Role of Embeddings in Voice Agent Knowledge

Embeddings are the numerical representations of text that make retrieval-augmented generation work. Most voice agent builders never have to think about embeddings directly — their platform handles them.

Tyler Weitzman
Tyler Weitzman
January 20, 2026 · 5 min read
Speechify

Embeddings are the numerical representations of text that make retrieval-augmented generation work. Most voice agent builders never have to think about embeddings directly — their platform handles them. But understanding what's happening underneath helps you debug retrieval failures and pick the right tools when defaults break down.

TL;DR

  • Embeddings turn text into vectors so similar text has similar vectors.
  • They're the backbone of RAG: your knowledge base gets embedded; user queries get embedded; you retrieve the closest matches.
  • Most voice agents in 2026 use OpenAI text-embedding-3 or Cohere multilingual.
  • Embedding model choice matters more than you'd think on edge cases (multilingual, technical jargon, code).

What embeddings are

A model takes a piece of text and outputs a vector of typically 768–3072 numbers. Texts with similar meaning produce similar vectors (small distance in vector space).

Example (illustrative, not real):

  • "How do I cancel my subscription?" → [0.12, -0.34, 0.05, ...]
  • "What's the cancellation process?" → [0.11, -0.33, 0.06, ...]
  • "What's your return policy?" → [0.45, 0.21, -0.18, ...]

The first two questions are semantically close; their vectors are close. The third is a different topic; its vector is far.

This is what makes retrieval work. You don't search for keyword matches; you search for vector proximity.

How they're used in voice agents

The flow:

At index time:

  1. Your knowledge base gets chunked into 200–500 token pieces.
  2. Each chunk gets embedded → vector.
  3. Vectors stored in a vector database (Pinecone, Qdrant, pgvector, etc.).

At query time:

  1. The user's question (or a generated query) gets embedded.
  2. Vector database returns the K closest chunks.
  3. Those chunks go into the LLM prompt.
  4. The LLM answers grounded in the retrieved content.

For more, see retrieval-augmented generation for voice agents.

Picking an embedding model

Common defaults in 2026:

ModelDimensionsCost / 1M tokensBest for
OpenAI text-embedding-3-small1536$0.02English, general use
OpenAI text-embedding-3-large3072$0.13Higher quality English
Cohere embed-multilingual-v31024$0.10Multilingual
Voyage voyage-31024$0.06Code, technical content
BGE-large (open-source)1024self-hostedCost-sensitive English

Default: OpenAI text-embedding-3-small. Switch if you have a specific reason.

What goes wrong

Three common failure modes:

1. Bad chunks. If your chunks are too long, embeddings get diluted. Too short, you lose context. 200–500 tokens per chunk is the sweet spot.

2. Embedding model mismatch. Using an English-tuned model for multilingual content drops accuracy noticeably. Match the model to your content language.

3. Stale embeddings. Your knowledge base updated but you didn't re-embed. The vectors are out of date.

For voice agents specifically, embedding latency matters too. Most embedding APIs return in 50–200ms, but cold starts can hit 500ms+. Keep them warm.

Re-embedding cadence

When to re-embed:

  • After significant doc updates (always).
  • After embedding model upgrade (rare; do it in a controlled migration).
  • After chunking strategy change (test on a small batch first).

Most teams re-embed nightly or weekly. Per-chunk embedding cost is small enough that this isn't expensive.

Hybrid retrieval

Pure vector search misses some queries. A common pattern: combine vector search with keyword search (BM25). For each query:

  1. Run vector search → top 10 candidates.
  2. Run keyword search → top 10 candidates.
  3. Merge and rerank.

Hybrid retrieval improves precision on queries with specific terms (product names, error codes) that pure embeddings sometimes miss.

For voice agents, hybrid retrieval adds latency. Use it only if pure vector search is showing measurable gaps.

Reranking

After retrieval, a reranker can re-score the top K against the query for better precision. Common rerankers: Cohere Rerank, Voyage Rerank, BGE reranker.

Adds 50–150ms of latency. Improves precision by 10–20% in most setups.

For voice, the latency hit usually isn't worth it. For chat with looser latency budgets, rerankers earn their cost.

Domain-tuned embeddings

For specialized content (legal, medical, code), generic embedding models miss nuance. Two options:

Fine-tuned embeddings. Take an open-source model and fine-tune on your domain. Real lift on domain-specific queries; significant engineering effort.

Domain-specific embedding models. Voyage and Cohere both have legal/medical specialized models. Often the easier path.

For most B2B voice agents, generic embeddings are fine. Domain tuning matters most for verticals with specialized vocabulary.

What about end-to-end audio embeddings?

A research direction worth knowing about: instead of audio → text → embedding, embed the audio directly. This skips the STT step.

Currently more research than production. Quality is comparable to text embeddings but ecosystem maturity is much lower. Watch for the next 1–2 years.

FAQ

How many dimensions should my embeddings have? Use whatever your model produces. Lower-dimensional models are faster but lossy. Higher-dimensional ones are more precise but slower to search.

Can I mix embedding models? Within a single index, no — vectors must be from the same model to be comparable. Across indexes, fine.

What's the difference between embeddings and tokens? Tokens are what the LLM operates on (words/subwords); embeddings are the vector representation of text. Different layers.

Do I need a vector database? For 10,000+ chunks, yes. Below that, in-memory works.

Is RAG going away? No. Even with much larger context windows, RAG is more reliable and cheaper than dumping everything into the prompt.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all →

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.