AI Voice Agent Glossary

43 essential terms for understanding conversational AI, speech technology, voice agent architecture, telephony, compliance, and performance metrics — defined clearly and without jargon.

Core Concepts

AI Voice Agent

An AI-powered software system that conducts real-time phone or voice conversations with humans. Unlike simple chatbots, voice agents combine speech recognition, language understanding, and text-to-speech to hold natural, back-and-forth spoken dialogues. They are used for customer support, lead qualification, appointment booking, and outbound calling.

Conversational AI

The umbrella term for AI systems that can engage in human-like dialogue across text or voice channels. Conversational AI encompasses chatbots, virtual assistants, and voice agents. Modern conversational AI relies on large language models for understanding intent and generating contextually appropriate responses.

Natural Language Processing (NLP)

A branch of AI focused on enabling computers to understand, interpret, and generate human language. NLP powers everything from sentiment analysis to machine translation. In voice agents, NLP processes the transcribed text from speech recognition to extract meaning before generating a response.

Natural Language Understanding (NLU)

A subset of NLP focused specifically on comprehension — determining what a speaker means, not just what they said. NLU handles intent classification (what does the caller want?), entity extraction (which account number, which date?), and contextual disambiguation. It is the critical step between transcription and action in a voice agent pipeline.

Large Language Model (LLM)

A neural network trained on massive text corpora that can generate, summarize, and reason about language. Models like GPT-4, Claude, and Gemini serve as the "brain" of modern voice agents, deciding what to say next based on the conversation history, system prompt, and any retrieved knowledge. LLM quality directly affects how natural and accurate a voice agent sounds.

Agentic AI

AI systems that can autonomously plan, use tools, and take multi-step actions to accomplish goals — not just answer questions. An agentic voice agent might look up a customer record, check inventory, place an order, and schedule a delivery in a single call, all without human intervention. The shift from "chatbot" to "agent" is defined by this ability to act, not just respond.

Multimodal AI

AI systems that can process and generate multiple types of data — text, audio, images, and video — simultaneously. In voice agents, multimodal capabilities allow the AI to interpret tone of voice, send visual confirmations via SMS mid-call, or reference images shared by the caller. OpenAI's GPT-4o and Google's Gemini are examples of natively multimodal models.

Speech Technology

Text-to-Speech (TTS)

Technology that converts written text into spoken audio. Modern neural TTS engines produce voices nearly indistinguishable from human speech, with control over pace, emotion, and emphasis. TTS is the final step in a voice agent pipeline — it turns the LLM's text response into the audio the caller hears.

Speech-to-Text (STT) / Automatic Speech Recognition (ASR)

Technology that converts spoken audio into written text in real time. Also called ASR (Automatic Speech Recognition). STT is the first step in a voice agent pipeline — it transcribes what the caller says so the LLM can process it. Accuracy, speed, and the ability to handle accents, background noise, and domain-specific vocabulary are key differentiators.

Voice Cloning

The process of creating a synthetic voice that mimics a specific person's speech patterns, timbre, and cadence. Typically requires a few minutes to a few hours of sample audio. Businesses use voice cloning to create branded agent voices or replicate a specific spokesperson. Ethical and legal considerations around consent and disclosure are significant.

Neural TTS

Text-to-speech systems powered by deep neural networks rather than older concatenative or parametric methods. Neural TTS produces far more natural-sounding speech with better prosody, intonation, and emotional range. Providers like ElevenLabs, Play.ht, and Cartesia have pushed neural TTS quality to near-human levels.

Streaming TTS

A TTS approach where audio is generated and delivered in small chunks as the text is produced, rather than waiting for the full response to be synthesized. Streaming TTS is critical for low-latency voice agents — it lets the agent start speaking within a few hundred milliseconds of the LLM producing its first tokens, rather than waiting seconds for a complete response.

Voice Activity Detection (VAD)

An algorithm that determines when a speaker is talking versus when there is silence or background noise. VAD is essential for turn-taking in voice agents — it tells the system when the caller has finished speaking so the agent can respond, and when the caller is interrupting so the agent should stop. Poor VAD leads to awkward cross-talk or premature cutoffs.

Echo Cancellation

Signal processing that removes the agent's own audio from the microphone input, preventing feedback loops where the agent hears and responds to itself. Echo cancellation is a solved but non-trivial problem in telephony. Without it, voice agents can enter infinite loops or produce garbled audio.

Barge-in

The ability for a caller to interrupt the AI agent mid-sentence and immediately be heard. When a caller says "actually, never mind" while the agent is still talking, barge-in detection stops the agent's audio and processes the interruption. Good barge-in handling is one of the biggest differentiators between natural and robotic voice agents.

Architecture

Retrieval-Augmented Generation (RAG)

A technique where an LLM retrieves relevant documents or knowledge base entries before generating a response, grounding its answer in real data rather than relying solely on training knowledge. For voice agents, RAG means the agent can look up product specs, policy documents, or customer history mid-conversation and give accurate, specific answers instead of hallucinating.

Function Calling / Tool Use

The ability of an LLM to invoke external functions or APIs during a conversation — checking a database, booking an appointment, processing a payment, or transferring a call. Function calling is what transforms a voice agent from a conversationalist into an agent that can actually do things. The LLM decides when to call which function and with what parameters.

Prompt Engineering

The practice of designing the instructions (system prompt) that govern how an AI agent behaves — its personality, rules, escalation logic, and response style. In voice agents, prompt engineering also covers turn-taking behavior, how to handle silence, when to ask clarifying questions, and how to gracefully end calls. Small prompt changes can dramatically affect agent quality.

Context Window

The maximum amount of text (measured in tokens) that an LLM can consider at once. For voice agents, the context window must hold the system prompt, conversation history, any RAG-retrieved documents, and function call results. Longer calls risk exceeding the window, causing the agent to "forget" earlier parts of the conversation. Modern models offer 128K–1M token windows.

Guardrails

Safety mechanisms that constrain what an AI agent can say or do. Guardrails include topic restrictions (don't discuss competitors), factual grounding (only cite information from the knowledge base), tone enforcement (stay professional), and action limits (never issue refunds above $500 without human approval). They are the primary defense against hallucination and off-script behavior.

Conversation Memory

The system's ability to recall information from earlier in the current conversation or from previous interactions with the same caller. Within-call memory is handled by the context window. Cross-call memory requires storing and retrieving caller profiles or summaries. Memory enables personalization — "I see you called last week about the same issue."

Turn-Taking

The mechanism that manages when the agent speaks and when it listens. Natural turn-taking in human conversation involves subtle cues — pauses, intonation changes, filler words. Voice agents use VAD, silence detection, and endpoint detection to approximate this. Poor turn-taking is the most common complaint about AI phone agents.

Latency

The delay between when a caller finishes speaking and when the agent begins responding. In phone conversations, humans expect responses within 300–800 milliseconds. Latency in voice agents is the sum of STT processing time, LLM inference time, and TTS generation time. Sub-second latency requires streaming architectures, edge deployment, and careful model selection.

Telephony & Integration

SIP (Session Initiation Protocol)

The standard protocol for initiating, maintaining, and terminating voice calls over IP networks. SIP is the backbone of modern telephony infrastructure. Voice agent platforms connect to the phone network via SIP trunks, which route calls between the PSTN (traditional phone network) and the internet. SIP configuration affects call quality, routing, and failover.

WebRTC

Web Real-Time Communication — an open standard for peer-to-peer audio, video, and data streaming directly in web browsers. Voice agents use WebRTC for browser-based voice interactions (widget on a website, in-app calling) without requiring phone numbers or telephony infrastructure. WebRTC typically offers lower latency than PSTN calls.

IVR (Interactive Voice Response)

The legacy system of pre-recorded prompts and touchtone menus ("Press 1 for billing, press 2 for support"). IVR has been the standard for automated phone handling since the 1980s. AI voice agents are rapidly replacing IVR because they understand natural speech, handle complex requests, and resolve calls instead of just routing them. IVR completion rates average below 30%; AI agents achieve 60–80%.

DTMF

Dual-Tone Multi-Frequency — the technical name for touchtone signals generated when pressing phone keys. Many legacy systems still require DTMF input for authentication or navigation. Voice agents need to detect and sometimes generate DTMF tones when interacting with other automated phone systems or when callers prefer keypad input.

Twilio

A cloud communications platform that provides APIs for phone calls, SMS, and video. Twilio is the most common telephony provider used by voice agent platforms to make and receive phone calls. It handles number provisioning, call routing, recording, and PSTN connectivity. Many voice agent deployments are "Bring Your Own Twilio" — customers use their existing Twilio account.

Call Transfer / Warm Handoff

The process of connecting a caller to a human agent when the AI cannot resolve the issue. A cold transfer simply forwards the call. A warm handoff transfers the call along with a summary of the conversation so far, so the human agent has full context. Warm handoff quality is critical — a bad transfer experience negates all the goodwill built by the AI portion of the call.

Branded Caller ID

Technology that displays a business name, logo, and call reason on the recipient's phone screen instead of just a phone number. Branded caller ID (via STIR/SHAKEN attestation and rich call data) dramatically improves answer rates for outbound AI calls — from roughly 15% for unknown numbers to 40%+ with branding. It also signals legitimacy and reduces spam flagging.

Business & Compliance

HIPAA

The Health Insurance Portability and Accountability Act — U.S. federal law governing the privacy and security of protected health information (PHI). Voice agents handling healthcare calls (appointment scheduling, prescription refills, symptom triage) must be HIPAA-compliant, which requires encryption, access controls, audit logging, and a Business Associate Agreement with every vendor in the chain.

TCPA

The Telephone Consumer Protection Act — U.S. federal law regulating telemarketing calls, auto-dialed calls, and prerecorded messages. TCPA requires prior express written consent for marketing calls to mobile phones, limits calling hours, and mandates opt-out mechanisms. AI voice agents making outbound calls must comply with TCPA or face fines of $500–$1,500 per violation.

SOC 2

System and Organization Controls 2 — an auditing standard that evaluates a service provider's controls for security, availability, processing integrity, confidentiality, and privacy. Enterprise buyers typically require SOC 2 Type II certification from their voice agent vendor, proving that security controls have been operating effectively over a sustained period, usually 6–12 months.

GDPR

The General Data Protection Regulation — the European Union's data protection law. GDPR affects voice agents that interact with EU residents, requiring lawful basis for processing, data minimization, right to erasure, and explicit consent for recording calls. Voice agent platforms serving EU markets must also handle data residency requirements — storing recordings and transcripts within the EU.

Business Associate Agreement (BAA)

A legal contract required under HIPAA between a healthcare provider (covered entity) and any vendor that handles protected health information on their behalf. Before deploying a voice agent in healthcare, the platform vendor, LLM provider, TTS provider, and telephony provider must all sign BAAs. A missing BAA anywhere in the chain creates a compliance gap.

Do Not Call (DNC)

A registry maintained by the FTC where consumers can register phone numbers to opt out of telemarketing calls. AI voice agents making outbound calls must scrub their call lists against the National DNC Registry and any internal DNC lists. Calling a number on the DNC list can result in fines of up to $51,744 per call under the Telemarketing Sales Rule.

Opt-in / Opt-out

Consent mechanisms for receiving automated calls or messages. Opt-in means the consumer explicitly agreed to be contacted (required for most marketing calls under TCPA). Opt-out means the consumer can revoke consent at any time, and the system must immediately honor that request. Voice agents must detect verbal opt-out requests ("take me off your list") and process them in real time.

Metrics & Analytics

Containment Rate

The percentage of calls fully resolved by the AI agent without transferring to a human. A containment rate of 70% means 7 out of 10 calls are handled end-to-end by AI. This is the single most important metric for measuring voice agent effectiveness. Well-configured agents in common use cases achieve 60–85% containment; poorly configured ones fall below 30%.

First Call Resolution (FCR)

The percentage of calls where the customer's issue is completely resolved during the first interaction, with no need for a callback or follow-up. FCR measures outcome quality, not just containment — a call can be "contained" by the AI but not truly resolved. High-performing voice agents achieve FCR rates comparable to experienced human agents (70–80%).

Average Handle Time (AHT)

The average duration of a call from start to finish, including hold time and after-call work. AI voice agents typically reduce AHT by 30–60% compared to human agents because they don't need to search for information, they process requests in parallel, and they don't have after-call documentation overhead. Lower AHT means lower cost per interaction.

Sentiment Analysis

The automated detection of a caller's emotional state — frustrated, satisfied, confused, angry — from their speech patterns, word choice, and tone. Voice agents use real-time sentiment analysis to adapt their behavior: slowing down when a caller is confused, escalating to a human when frustration rises, or adjusting tone when the caller is upset. Post-call sentiment scoring helps identify systemic issues.

Conversation Intelligence

Analytics platforms that analyze voice agent conversations at scale to extract insights — common caller questions, resolution patterns, drop-off points, agent performance trends, and emerging issues. Conversation intelligence transforms call data from a cost center into a strategic asset, revealing what customers actually want and where the agent experience breaks down.

A/B Testing for CX

Running controlled experiments on voice agent configurations — different prompts, voices, conversation flows, or escalation thresholds — and measuring which variant produces better outcomes. A/B testing is how high-performing voice agent deployments continuously improve. Metrics compared typically include containment rate, customer satisfaction (CSAT), average handle time, and conversion rate.

Ready to see these concepts in action?

SIMBA lets you deploy production-grade AI voice agents with built-in RAG, function calling, guardrails, and telephony integration — no ML team required.