Conversation tagging is what turns thousands of AI-handled calls into actionable insight. Every call should get tagged with intent, outcome, sentiment, and any anomalies — automatically, consistently, and in a way that supports both real-time routing and after-the-fact analytics. The tag schema is more important than most teams realize.

TL;DR

A good tagging schema captures intent, outcome, sentiment, and anomalies on every call.
Use a controlled vocabulary; don't let tags proliferate.
Auto-tag with the LLM; spot-check with humans.
Tags drive routing (real-time) and analytics (after the fact).

Why tag conversations

Three uses:

Routing. Tag triggers downstream actions — assign to specialist queue, escalate, follow up.

Analytics. Aggregate tags reveal patterns (intent volume, escalation reasons, sentiment trends).

Quality. Tagged calls support targeted QA review (sample all "negative_sentiment" calls).

Tags are the bridge between individual calls and operational insight.

What to tag

Six categories most teams use:

1. Intent (mandatory). What was the call about? "order_status", "password_reset", "billing_inquiry", "complaint".

2. Sub-intent (helpful). More specific. "billing_inquiry" → "billing_inquiry/auto_renewal", "billing_inquiry/refund_request".

3. Outcome. What happened? "resolved", "escalated", "abandoned", "voicemail_left".

4. Sentiment. "positive", "neutral", "negative".

5. Resolution method. "agent_resolved", "agent_escalated", "self_serve_redirect".

6. Flags (optional). Things worth attention. "anger_detected", "compliance_concern", "feature_request_mentioned".

The taxonomy problem

Tag taxonomies grow uncontrollably without discipline. After a year, you have 500 tags, half are duplicates, and reporting is useless.

Discipline:

Controlled vocabulary. Tags come from a predefined list. New tags require a process to add.

Hierarchical. Use intent / sub-intent rather than flat tag lists.

Periodic cleanup. Quarterly review; merge duplicates, retire unused tags.

Naming convention. snake_case, lowercase, no spaces. Consistent.

Auto-tagging

Tag every call automatically. Three approaches:

Rule-based. If the agent called function X, tag the call as intent Y. Works for clear cases.

LLM-based. After the call, an LLM reads the transcript and assigns tags from the vocabulary. Most flexible.

Hybrid. Rules for clear cases; LLM for the rest.

LLM-based is most common in 2026. Cost: ~$0.001 per call.

How to tag with an LLM

Prompt:

Read this call transcript and assign tags from the
following vocabulary:

Intents: [list of allowed intents]
Outcomes: [list]
Sentiments: positive, neutral, negative
Flags: [list of allowed flags]

Return JSON:
{
  intent: "...",
  sub_intent: "...",
  outcome: "...",
  sentiment: "...",
  flags: [...]
}

Transcript:
[transcript]

Run after every call. Store the result in your call record.

Real-time vs post-call tagging

Post-call is the default. Wait for the call to end; LLM tags. Works for analytics and async workflows.

Real-time is sometimes needed. Sentiment detection during the call; trigger escalation if anger detected. Tag updates as the call progresses.

Real-time tagging adds latency and complexity. Use only for cases that genuinely need real-time response.

Validating tag accuracy

Auto-tagging is noisier than human tagging. Periodic validation:

Pick 50 random tagged calls per week.
Manually re-tag.
Compare to the auto-tags.
Track agreement rate (target: 85%+).

If agreement drops, retrain the tagging prompt or update the vocabulary.

Using tags for routing

Real-time examples:

intent=refund_request + amount>100: Auto-escalate to human reviewer.

sentiment=negative + outcome=escalated: Route to most experienced agent.

flag=compliance_concern: Flag for compliance team review.

These rules turn tags into action.

Using tags for analytics

Aggregate views:

Volume by intent. Top 10 intents / week. Trend over time.

Resolution rate by intent. Which intents the AI resolves vs escalates.

Sentiment by intent. Which intents produce negative sentiment most often.

Escalation reasons. Why are calls escalating? Patterns?

Anomaly detection. Sudden spike in any tag = investigate.

Common tagging mistakes

Too many tags per call. A call gets 15 tags; meaning is diluted. Limit to 5-7.

Inconsistent vocabulary. "billing", "billing_question", "billing_issue" all in use. Pick one.

No process for new tags. Engineers add tags ad-hoc; vocabulary explodes.

Tags without action. Tagged but never used in routing or analytics. Useless.

No validation. Auto-tags assumed accurate; never checked.

A reasonable starter taxonomy

For a typical ecommerce AI agent:

Intents:

order_status
returns
refunds
shipping_change
account_update
password_reset
product_question
complaint
compliment
general_info

Outcomes:

resolved
escalated
abandoned
voicemail
transferred

Flags:

negative_sentiment
compliance_concern
urgent
vip_customer
recurring_caller

Start narrow; expand based on actual call patterns.

For more on the broader measurement framework, see how to measure voice agent quality.

FAQ

How many intents should I have? 20-50 for most businesses. More than that = over-segmented.

Should every call have a sentiment tag? Yes — even neutral. Easier to filter for negative.

Can I change the taxonomy after deployment? Yes — but plan for re-tagging historical calls if the changes are significant.

What about multilingual tagging? Use the same tag vocabulary regardless of call language. The LLM tagger can handle multilingual transcripts.

How does this work with cross-channel data? Same taxonomy across voice, chat, email. Comparable analytics.

How to Tag and Categorize AI Conversations

TL;DR

Why tag conversations

What to tag

The taxonomy problem

Auto-tagging

How to tag with an LLM

Real-time vs post-call tagging

Validating tag accuracy

Using tags for routing

Using tags for analytics

Common tagging mistakes

A reasonable starter taxonomy

FAQ

More from Tyler Weitzman

Open-Source vs Proprietary Voice Agent Stacks

Build vs Buy: When to Build Your Own Voice Agent

Voice Agents for Developer Support

Related reading

How to Calculate ROI for AI Customer Support

Cutting Average Handle Time with Voice Agents

Why First-Contact Resolution Is the North Star for AI Support

Voice AI, twice a month.