Voice AI uses a mix of telecom, machine learning, and contact-center jargon. If you're new to the space, the vocabulary alone is a barrier. This is a no-fluff glossary of the 50 terms that show up most often in real engineering and operations work.

TL;DR

Voice AI borrows from telephony (SIP, DTMF, PSTN), speech (STT, TTS, ASR), AI (LLM, RAG, function calling), and contact center (AHT, FCR, CSAT) vocabularies.
Once you know roughly 50 terms, almost every conversation in the field becomes parseable.

Telephony and audio

PSTN. Public Switched Telephone Network. The traditional phone network. Most calls to a voice agent come via PSTN.

SIP. Session Initiation Protocol. A standard for setting up voice calls over the internet. SIP trunks let businesses route calls without a traditional phone line.

SIP trunk. A virtual phone connection from a telephony provider to your voice agent infrastructure.

WebRTC. Web Real-Time Communications. The browser-based audio standard, used for in-app voice agents.

DTMF. Dual-Tone Multi-Frequency. The "press 1 for sales" tones. Voice agents can accept DTMF alongside speech.

Codec. Audio compression format. Common codecs for voice: PCMU (G.711), Opus, AAC.

Sample rate. Audio quality measured in samples per second. Phone audio is 8 kHz; high-quality audio is 16 kHz or higher.

Jitter. Variation in network packet timing. Causes audio glitches when high.

Speech recognition

STT. Speech-to-Text. Converts audio to text. Synonym for ASR.

ASR. Automatic Speech Recognition. Older term for STT.

WER. Word Error Rate. The percentage of words an STT system gets wrong. Lower is better; under 5% is good.

Streaming STT. STT that returns partial transcripts as audio arrives, instead of waiting for the speaker to finish.

Endpointer. A model that decides when the speaker has finished a turn. Critical for natural conversation.

VAD. Voice Activity Detection. Detects whether someone is speaking right now.

Custom vocabulary. A list of domain-specific words biased into the STT model so it recognizes them correctly.

Diarization. Identifying which speaker said what in a multi-speaker conversation.

Speech synthesis

TTS. Text-to-Speech. Converts text to audio.

Neural TTS. Modern TTS using deep learning. Sounds essentially human.

Streaming TTS. TTS that produces audio chunks as text arrives, instead of waiting for the full sentence.

Voice cloning. Creating a synthetic voice from a sample of someone speaking.

Time to first audio (TTFA). How long after sending text to TTS before the first audio chunk is available. The single most important TTS metric for voice agents.

Prosody. The rhythm, pitch, and intonation of speech. Modern TTS handles this well; older TTS sounds robotic.

Large language models

LLM. Large Language Model. The reasoning brain inside a voice agent.

Time to first token (TTFT). How long after sending a prompt to the LLM before the first response token. Critical for voice latency.

Prompt caching. A technique where the LLM provider stores the static portion of your prompt and reuses computation across requests. Cuts cost and latency.

System prompt. The instructions that define the agent's role, tone, and rules.

Function calling. When the LLM emits a structured request to call a tool (e.g., lookup_account(id=123)).

Tool use. Synonym for function calling.

RAG. Retrieval-Augmented Generation. Pulling relevant docs into the prompt so the LLM can answer with grounded info.

Embedding. A numerical representation of text used for similarity search in RAG.

Hallucination. When the LLM makes something up. Bad in voice agents; mitigated with RAG and guardrails.

Guardrails. Rules and filters that prevent the agent from saying things it shouldn't.

Eval. A test that grades the agent's behavior on a representative input. Essential for production voice agents.

Conversation mechanics

Turn-taking. The system that decides whose turn it is to speak.

Barge-in. When the user starts talking while the agent is talking.

Backchannel. Brief utterances ("uh-huh," "right") that signal listening without interrupting.

Latency. Total time between end-of-user-speech and start-of-agent-audio. Sub-500ms is the bar.

P99. The 99th percentile latency. Often much worse than the median; what makes "fast" agents inconsistent.

Telephony and call flow

IVR. Interactive Voice Response. Pre-AI menu systems ("press 1 for sales").

Inbound. Calls coming to the agent.

Outbound. Calls placed by the agent.

Transfer. Routing the call to a human or another agent.

Warm transfer. Transfer with conversation context handed over.

Cold transfer. Transfer with no context — the human starts from scratch.

Compliance and security

TCPA. Telephone Consumer Protection Act. U.S. law restricting outbound calling.

HIPAA. U.S. healthcare privacy law. Affects voice agents in healthcare.

SOC 2. Security audit standard. Often required by enterprise buyers.

PII. Personally Identifiable Information. Must be handled carefully.

A2P 10DLC. Application-to-Person 10-Digit Long Code — a U.S. SMS framework that affects SMS sent from voice agents.

Contact center metrics

AHT. Average Handle Time. How long a call takes end to end.

FCR. First Contact Resolution. Did the issue get resolved on the first call?

CSAT. Customer Satisfaction. Usually measured 1–5 after the call.

Deflection rate. Percentage of calls the AI handled without escalating to a human.

CCaaS. Contact Center as a Service. Hosted contact center platforms (Five9, Genesys, etc.).

That's the core 50. Reading the rest of this site, you'll see most of these used in context. For a deeper look at any single one, the linked articles go further.

FAQ

Which terms are most important for non-engineers to know? Latency, deflection rate, AHT, FCR, CSAT, and escalation. Those five cover most operational conversations.

Which terms are most important for engineers? TTFT, TTFA, WER, function calling, endpointer, and barge-in. The others can wait.

Where can I learn the telephony side better? Twilio's docs are the standard reference. Even if you're not on Twilio, their explanations of SIP, DTMF, and codecs are clear.

Are these terms standardized? Mostly. Some terms (like "deflection") have slightly different meanings across vendors. Always check what your vendor specifically means.

Voice AI Glossary: 50 Terms You Need to Know

TL;DR

Telephony and audio

Speech recognition

Speech synthesis

Large language models

Conversation mechanics

Telephony and call flow

Compliance and security

Contact center metrics

FAQ

More from Cliff Weitzman

Why Voice Will Be the Default UX for Enterprise AI

The Economics of AI Voice Agents at Scale

How AI Voice Will Reshape Customer Service Jobs

Related reading

First-Time Builder's Guide to Voice Agents

Why Voice AI Will Transform Phone Channels by 2030

Voice Agent Use Cases: A Field Guide

Voice AI, twice a month.