The Anatomy of a Voice Agent Pipeline
The Anatomy of a Voice Agent Pipeline. A practical, vendor-neutral guide for teams building or buying voice AI agents.
If you took every voice agent in production today and dissected them, you'd find roughly the same skeleton. The names change. The vendors change. The plumbing details vary. But the bones are the same: a telephony layer feeds audio into a streaming recognizer, the recognizer feeds text to an LLM, the LLM hands replies to a TTS engine, and the TTS engine pushes audio back out the way it came.
This piece is the reference diagram. If you're building, comparing vendors, or trying to figure out why your latency is bad, having the full pipeline in your head is worth more than any benchmark.
TL;DR
- A production voice agent has 8 distinct layers, each with its own failure mode.
- The pipeline is shaped less like a sequence and more like a flowing river β every stage is streaming, and most stages overlap in time.
- Three places eat most of your latency budget: the endpointer, the LLM time-to-first-token, and the TTS time-to-first-audio.
- The hidden complexity isn't in the AI models. It's in the orchestration, telephony glue, and post-call handling.
The eight layers
In order from "audio comes in" to "audio goes out and the call ends":
Layer 1 β Telephony / transport
The pipe. PSTN audio arrives via a telephony provider (Twilio, Plivo, Telnyx, Bandwidth) or a SIP trunk you operate yourself. Web audio comes via WebRTC. Either way, you get a continuous audio stream β typically PCM 16-bit at 8 kHz (phone) or 16 kHz (WebRTC) β and a control channel.
Done badly, this layer alone adds 100β300ms of latency. Done well, you carve it down to 30β80ms by terminating audio at a server in the same region as your STT and routing accordingly.
We have a separate piece on SIP trunking 101 for voice agent builders if you're picking telephony.
Layer 2 β Voice activity detection (VAD)
The simplest signal in the pipeline: is the human speaking right now? Open-source models like Silero VAD or webrtc-vad ship out of the box. They run in CPU on a small frame size (10β30ms) and produce a binary "speech or not" stream.
VAD is your earliest signal that something is happening, and it's the input to the endpointer. It also triggers barge-in detection.
Layer 3 β Streaming STT / ASR
Audio in, text out, continuously. Streaming partials every 50β100ms; final transcript when the endpointer fires.
Pick based on three things: WER on your audio domain, latency to first partial, and language coverage. Custom vocabulary support is underrated β it's the difference between an agent that gets your product names right and one that doesn't.
Layer 4 β Endpointer
Watches VAD + transcript + audio prosody and decides "the caller is done speaking, now trigger the LLM." Get this wrong and the agent either jumps in mid-sentence (rude) or sits in 1.5 seconds of dead air after every caller utterance (slow).
Rule of thumb: a fixed silence threshold of 600β800ms is the lazy version. A learned endpointer that reads sentence completeness from the transcript is the right version. Difference between the two is often 300ms of perceived latency.
Layer 5 β LLM / orchestration
The brain. Inputs: system prompt, running transcript, tool schemas, retrieved knowledge. Outputs: text reply, optionally a function call.
What "orchestration" actually means in production:
- Function calling. When the LLM decides "I need to look up this caller," your code intercepts the function call, runs it against your CRM, and returns the result. The model then continues with the new context.
- Retrieval. Before sending the prompt, you may pull relevant docs from a knowledge base β a returns policy, a product spec β and stitch them into context.
- Guardrails. A second-pass check that the reply doesn't say something it shouldn't. (Examples: never quote a price the model invented, never agree to a refund above $X.)
- Streaming. Token-by-token output so TTS can start synthesizing before the model finishes thinking.
We have a deep piece on function calling for voice agents: a practical guide.
Layer 6 β TTS
Text in, audio out. Streaming if your TTS supports it. The output is audio chunks (Opus, MP3, or raw PCM) that get pushed back through the telephony layer.
The single most important property here is time to first audio chunk. Some TTS vendors hit 150ms. Others sit at 800ms. That delta is felt directly by the caller.
Layer 7 β Barge-in handler
Runs in parallel with TTS playback. Watches incoming VAD; if the caller starts talking while the agent is talking, you immediately stop TTS, flush the audio buffer at the telephony layer, cancel any pending LLM tokens that would have been spoken, and go back to listening.
The hard part is the flush. If you don't drop the audio buffered at Twilio's edge, the agent keeps talking for another 500ms after the caller starts. We cover this in how voice agents handle interruptions gracefully.
Layer 8 β Post-call
Often forgotten, often where the real product value lives:
- Transcript persistence β where the call gets logged for later review.
- Summary generation β a one-paragraph summary written by an LLM after hangup.
- CRM sync β a structured note posted to Salesforce, HubSpot, or wherever the system of record lives.
- Webhook fires β if the call resulted in a booked appointment, a follow-up SMS, or an escalation, the right systems get pinged.
- Analytics β call duration, intents detected, resolution status, sentiment.
The post-call layer is what turns a voice agent from a novelty into a business tool. Without it, you have voice β without it, no one above your team will ever look at the data.
How the layers actually run (it's not a sequence)
Drawing the pipeline as a left-to-right pipe is misleading. It's actually closer to a set of overlapping conveyor belts:
Caller audio: ββββββββββββββββββββββββββββ____________________________
VAD: _βββββββββββββββββββββββββββ____________________________
STT partials: ___βββββββββββββββββββββββββ____________________________
Endpointer fires: β_______________________________
LLM: βββββββββββ____________________
TTS: ββββββββββββββββ__________
Agent audio out: ββββββββββββββββββββ
Caller listening: ββββββββββββββββββββ
Three things to notice:
- STT runs concurrently with the caller talking. It's not "wait for the caller to finish, then transcribe." Partial transcripts are flowing in real time.
- TTS starts before the LLM is done. The first sentence of the reply gets synthesized while the model is still generating the rest.
- The endpointer is the choke point. It's the only stage that cannot overlap with anything else β by definition, it's the moment we decide to stop listening and start replying.
That's why endpointer quality is the highest-leverage tuning parameter in the entire stack.
Where most builds go wrong
Pattern matching across dozens of voice agent builds, the failure modes cluster into a small number of bins:
- Treating each layer as independent. "Pick the best STT, pick the best LLM, pick the best TTS." Doesn't work β the interfaces between layers are where most of the latency lives.
- Ignoring the endpointer. Default 1-second silence threshold, no prosodic features, awkward pauses on every utterance.
- No streaming TTS. Wait for the LLM to finish, then synthesize. Adds 500ms+ of perceived latency for no reason.
- No barge-in. Caller has to wait politely for the agent to finish a sentence before they can interject. Feels robotic.
- No post-call layer. The call happens, but nothing useful flows out of it. CRM stays empty.
- Function calls without retries. A flaky CRM lookup β agent hallucinates an answer β caller gets wrong info.
The integration question: build vs buy each layer
For each layer, the build-vs-buy decision is different:
| Layer | Build it yourself? | Why |
|---|---|---|
| Telephony | No | Twilio/Plivo are good and cheap |
| VAD | No | Silero is free and works |
| STT | Almost never | Hosted models are 10x better than what you'd build |
| Endpointer | Sometimes | Off-the-shelf is OK; custom is better |
| LLM | Almost never | Use a hosted model; swap as needed |
| Orchestration | Yes (or use a platform) | This is where your business logic lives |
| TTS | Almost never | Hosted neural TTS is years ahead of open-source |
| Barge-in | Build (or use platform) | Glue code, but critical |
| Post-call | Build | Specific to your CRM and workflows |
For the lazy-but-correct path, most teams pick a voice agent platform that handles layers 1β7 and lets them write their own orchestration and post-call logic.
A complete reference architecture
Putting it all together, here's the architecture diagram for a production voice agent:
ββββββββββββββ
Callerβ PSTN β
βββββββ¬βββββββ
β audio
βββββββΌβββββββ
β Telephony β Twilio / Plivo / SIP
β (Layer 1)β
βββββββ¬βββββββ
β PCM frames
ββββββββββΌβββββββββ
β β β
βββββββΌβββ ββββΌβββ ββββΌββββββββββ
β VAD β β STT β β Barge-in β
β (L2) β β(L3) β β (L7) β
βββββββ¬βββ ββββ¬βββ ββββββββββββββ
β β
βββββΌββββββββΌββββ
β Endpointer β
β (L4) β
βββββ¬ββββββββββββ
β "caller done"
βββββΌββββββββββββββββββββββββββ
β LLM + Orchestration β
β (L5) β
β β
β βββββββββββ ββββββββββββββ β
β β tools / β β retrieval β β
β β funcs β β guardrails β β
β βββββββββββ ββββββββββββββ β
βββββ¬ββββββββββββββββββββββββββ
β streaming text
βββββΌβββββββ
β TTS β
β (L6) β
βββββ¬βββββββ
β audio frames
βΌ
back to telephony β caller
After hangup:
ββββββββββββββββ
β Post-call β
β (L8) β
β β
β summary, CRM β
β sync, webhooksβ
β analytics β
ββββββββββββββββ
That's the whole picture. Every voice agent platform you can name fits this shape β they just package the layers differently.
FAQ
Do I need every layer? For a real production agent, yes. Cutting layers (no barge-in, no post-call) ships faster but creates obvious quality gaps that you'll feel within a week of going live.
What's the cheapest path to a working prototype? Pick a platform that gives you layers 1β7 out of the box. Wire your business logic into the orchestration layer. Don't try to assemble the stack from scratch on day one.
Where does multimodal input fit in (e.g., the caller sends a photo)? Multimodal voice agents are still rare in 2026 because most calls happen on PSTN, which has no image channel. WebRTC-based voice agents are starting to support image and screen-share inputs alongside audio.
Can the LLM live on the same server as STT and TTS? It can, and for very latency-sensitive deployments people do this. The downside is operational β running GPU inference for STT, LLM, and TTS on the same box is a lot to manage. Most teams use separate hosted services and accept the tiny network cost.
Where does sentiment analysis fit? Either in real time inside the orchestration layer (rare β adds latency) or in the post-call layer (common). Real-time sentiment is mostly a "interesting but rarely useful" feature in production.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems β text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
Related reading
The Hidden Complexity of Numbers in Voice Agents
The Hidden Complexity of Numbers in Voice Agents. A practical, vendor-neutral guide for teams building or buying voice AI agents.
How a Conversational Voice Agent Actually Works (Under the Hood)
How a Conversational Voice Agent Actually Works (Under the Hood). A practical, vendor-neutral guide for teams building or buying voice AI agents.
How Voice Agents Recover from Misunderstandings
How Voice Agents Recover from Misunderstandings. A practical, vendor-neutral guide for teams building or buying voice AI agents.