Voice AI Glossary: 50 Terms You Need to Know
Voice AI uses a mix of telecom, machine learning, and contact-center jargon. If you're new to the space, the vocabulary alone is a barrier. This is a no-fluff glossary of the 50 terms that show up most often in real engineering and operations work.
Voice AI uses a mix of telecom, machine learning, and contact-center jargon. If you're new to the space, the vocabulary alone is a barrier. This is a no-fluff glossary of the 50 terms that show up most often in real engineering and operations work.
TL;DR
- Voice AI borrows from telephony (SIP, DTMF, PSTN), speech (STT, TTS, ASR), AI (LLM, RAG, function calling), and contact center (AHT, FCR, CSAT) vocabularies.
- Once you know roughly 50 terms, almost every conversation in the field becomes parseable.
Telephony and audio
PSTN. Public Switched Telephone Network. The traditional phone network. Most calls to a voice agent come via PSTN.
SIP. Session Initiation Protocol. A standard for setting up voice calls over the internet. SIP trunks let businesses route calls without a traditional phone line.
SIP trunk. A virtual phone connection from a telephony provider to your voice agent infrastructure.
WebRTC. Web Real-Time Communications. The browser-based audio standard, used for in-app voice agents.
DTMF. Dual-Tone Multi-Frequency. The "press 1 for sales" tones. Voice agents can accept DTMF alongside speech.
Codec. Audio compression format. Common codecs for voice: PCMU (G.711), Opus, AAC.
Sample rate. Audio quality measured in samples per second. Phone audio is 8 kHz; high-quality audio is 16 kHz or higher.
Jitter. Variation in network packet timing. Causes audio glitches when high.
Speech recognition
STT. Speech-to-Text. Converts audio to text. Synonym for ASR.
ASR. Automatic Speech Recognition. Older term for STT.
WER. Word Error Rate. The percentage of words an STT system gets wrong. Lower is better; under 5% is good.
Streaming STT. STT that returns partial transcripts as audio arrives, instead of waiting for the speaker to finish.
Endpointer. A model that decides when the speaker has finished a turn. Critical for natural conversation.
VAD. Voice Activity Detection. Detects whether someone is speaking right now.
Custom vocabulary. A list of domain-specific words biased into the STT model so it recognizes them correctly.
Diarization. Identifying which speaker said what in a multi-speaker conversation.
Speech synthesis
TTS. Text-to-Speech. Converts text to audio.
Neural TTS. Modern TTS using deep learning. Sounds essentially human.
Streaming TTS. TTS that produces audio chunks as text arrives, instead of waiting for the full sentence.
Voice cloning. Creating a synthetic voice from a sample of someone speaking.
Time to first audio (TTFA). How long after sending text to TTS before the first audio chunk is available. The single most important TTS metric for voice agents.
Prosody. The rhythm, pitch, and intonation of speech. Modern TTS handles this well; older TTS sounds robotic.
Large language models
LLM. Large Language Model. The reasoning brain inside a voice agent.
Time to first token (TTFT). How long after sending a prompt to the LLM before the first response token. Critical for voice latency.
Prompt caching. A technique where the LLM provider stores the static portion of your prompt and reuses computation across requests. Cuts cost and latency.
System prompt. The instructions that define the agent's role, tone, and rules.
Function calling. When the LLM emits a structured request to call a tool (e.g., lookup_account(id=123)).
Tool use. Synonym for function calling.
RAG. Retrieval-Augmented Generation. Pulling relevant docs into the prompt so the LLM can answer with grounded info.
Embedding. A numerical representation of text used for similarity search in RAG.
Hallucination. When the LLM makes something up. Bad in voice agents; mitigated with RAG and guardrails.
Guardrails. Rules and filters that prevent the agent from saying things it shouldn't.
Eval. A test that grades the agent's behavior on a representative input. Essential for production voice agents.
Conversation mechanics
Turn-taking. The system that decides whose turn it is to speak.
Barge-in. When the user starts talking while the agent is talking.
Backchannel. Brief utterances ("uh-huh," "right") that signal listening without interrupting.
Latency. Total time between end-of-user-speech and start-of-agent-audio. Sub-500ms is the bar.
P99. The 99th percentile latency. Often much worse than the median; what makes "fast" agents inconsistent.
Telephony and call flow
IVR. Interactive Voice Response. Pre-AI menu systems ("press 1 for sales").
Inbound. Calls coming to the agent.
Outbound. Calls placed by the agent.
Transfer. Routing the call to a human or another agent.
Warm transfer. Transfer with conversation context handed over.
Cold transfer. Transfer with no context โ the human starts from scratch.
Compliance and security
TCPA. Telephone Consumer Protection Act. U.S. law restricting outbound calling.
HIPAA. U.S. healthcare privacy law. Affects voice agents in healthcare.
SOC 2. Security audit standard. Often required by enterprise buyers.
PII. Personally Identifiable Information. Must be handled carefully.
A2P 10DLC. Application-to-Person 10-Digit Long Code โ a U.S. SMS framework that affects SMS sent from voice agents.
Contact center metrics
AHT. Average Handle Time. How long a call takes end to end.
FCR. First Contact Resolution. Did the issue get resolved on the first call?
CSAT. Customer Satisfaction. Usually measured 1โ5 after the call.
Deflection rate. Percentage of calls the AI handled without escalating to a human.
CCaaS. Contact Center as a Service. Hosted contact center platforms (Five9, Genesys, etc.).
That's the core 50. Reading the rest of this site, you'll see most of these used in context. For a deeper look at any single one, the linked articles go further.
Related reading
- What Is a Voice Agent? A 2026 Primer
- First-Time Builder's Guide to Voice Agents
- Why Voice AI Will Transform Phone Channels by 2030
- Voice Agent Use Cases: A Field Guide
- Synchronous vs Asynchronous Voice Agents
FAQ
Which terms are most important for non-engineers to know? Latency, deflection rate, AHT, FCR, CSAT, and escalation. Those five cover most operational conversations.
Which terms are most important for engineers? TTFT, TTFA, WER, function calling, endpointer, and barge-in. The others can wait.
Where can I learn the telephony side better? Twilio's docs are the standard reference. Even if you're not on Twilio, their explanations of SIP, DTMF, and codecs are clear.
Are these terms standardized? Mostly. Some terms (like "deflection") have slightly different meanings across vendors. Always check what your vendor specifically means.

Cliff Weitzman is the CEO and co-founder of Speechify, the world's leading text-to-speech app. As a Forbes 30 Under 30 honoree, Cliff has spent more than a decade building consumer and enterprise products that make voice technology accessible to everyone. He writes about the future of voice AI, how natural-sounding agents will reshape customer experience, and how teams should think about deploying conversational AI responsibly.
More from Cliff Weitzman
View all โWhy Voice Will Be the Default UX for Enterprise AI
For the last three years, "chat with AI" has been the dominant UX paradigm in enterprise AI products. Type a question, AI types back. This works โ it's how most people first encountered large language models, and it's efficient for many workflows.
The Economics of AI Voice Agents at Scale
AI voice agents looked economically interesting at small scale in 2024. At medium scale in 2025, they started beating outsourced alternatives on obvious metrics. In 2026, at high scale โ millions of calls per month โ the economics become genuinely disruptive.
How AI Voice Will Reshape Customer Service Jobs
The customer service industry employs roughly 3 million people in the US alone. Most of their work is handling phone calls, most of those calls follow patterns, and most of those patterns are automatable.
Related reading
First-Time Builder's Guide to Voice Agents
Building your first voice agent is mostly about resisting the urge to overengineer. You don't need to compare 8 LLMs. You don't need to design a multi-agent architecture. You need to get a single bounded agent on the phone, listen to it talk to real humans, and iterate.
Why Voice AI Will Transform Phone Channels by 2030
The phone is not going away. Despite a decade of "the phone is dying" predictions, U.S. consumers still place over 30 billion service calls a year. What's changing is what answers them.
Voice Agent Use Cases: A Field Guide
The "voice AI for customer service" pitch has gotten so widespread that it's hard to remember how many specific use cases live underneath it. Some are mature and ready to deploy. Some are still painful.
Voice AI, twice a month.
Get the best of the SIMBA resources hub โ new articles, trend notes, and operator guides. No spam.
