Diarization: Knowing Who's Speaking in a Voice Conversation
Speaker diarization is the task of answering "who spoke when?" Given audio with multiple speakers, diarization outputs time-stamped segments labeled by speaker. For most voice agent use cases — one caller, one agent — diarization is trivial (channel-based separation works).
Speaker diarization is the task of answering "who spoke when?" Given audio with multiple speakers, diarization outputs time-stamped segments labeled by speaker. For most voice agent use cases — one caller, one agent — diarization is trivial (channel-based separation works). But when the scene gets complicated (multi-party calls, ambient voices, families on a shared phone), diarization becomes real infrastructure. This piece covers when you need it and how to handle it.
TL;DR
- Most voice agents don't need diarization — channel separation suffices.
- Where it matters: multi-person calls, shared phones, ambient background speech, legal/compliance recording.
- Modern diarization is decent but imperfect — 5-15% speaker error rate typical.
- Real-time diarization is harder than offline.
- Combine with voice biometrics where identification (not just separation) matters.
Diarization vs identification
Diarization: "There were two speakers; speaker A said X, speaker B said Y."
Doesn't know who A and B are, just that they're different.
Identification: "Speaker A is Jamie, speaker B is Michael."
Adds identity — requires voice biometrics or prior recordings.
The channel-separation shortcut
For most voice agents:
- Caller is on one RTP stream (ingress).
- Agent is on another stream (egress).
- Separation is mechanical.
No diarization algorithm needed.
When diarization matters
Multi-party calls. Family on speaker: "Mom says Thursday works."
Shared phones. Husband and wife sharing a line; different account access.
Ambient speech. Call from an office with background conversation.
Legal / compliance. Multi-party recording where who-said-what matters.
Conference bridge. Multiple external parties.
Real-time challenges
Offline diarization has access to the full audio. Real-time:
- Limited context.
- Must make decisions on the fly.
- Higher error rate.
2026 real-time diarization error: 10-25% typical.
Providers
- Deepgram Diarize.
- AssemblyAI Diarization.
- Pyannote (open-source).
- AWS Transcribe speaker identification.
- Azure Speaker Recognition.
Quality varies. Test on your audio.
The voice biometric layer
For identification (not just separation):
- Voice biometrics compares current voice to stored profile.
- Registered user profiles pre-enrolled.
- Recognition confidence per speaker.
Voice cloning complicates: attacker with cloned voice passes biometric.
See how ai support agents should handle account verification.
Use cases in depth
Family account on shared phone.
- Caller identifies: "This is Mom."
- Voice biometric confirms.
- Agent proceeds with that person's account.
Multi-language family.
- Some family members speak Spanish, others English.
- Agent responds in language of primary speaker per turn.
Conference calls with multiple decision-makers.
- Agent captures whose opinions.
- Structured meeting notes: "CTO said yes; CFO said need more info."
Challenges
Similar voices. Brothers, twins, similar-gender. High error rate.
Accent-accented speakers in the same call. Training data may not cover.
Noisy environments. Background talk confuses diarization.
Overlap. Two people talking at once. Hard for any system.
Accuracy metrics
Diarization Error Rate (DER). Percentage of time diarized incorrectly.
- Under 10% DER: good.
- 10-20% DER: acceptable for some use cases.
- Over 20% DER: error-prone.
Missed speech. Diarization missed speech entirely.
False alarms. Diarization thought speech was present when not.
Speaker confusion. Multiple speakers mixed up.
Testing
- Build test audio with known speaker timings.
- Compare diarization output.
- Calculate DER.
- Identify failure patterns.
Pre-enrollment
For known-speaker use cases:
- Pre-record samples.
- Train speaker embeddings.
- Real-time matches embeddings.
Better accuracy than generic diarization.
Privacy considerations
- Voice recordings are biometric data.
- Storage and processing regulated (GDPR, CCPA, BIPA).
- Enrollment consent required.
Don't cavalier with voice biometrics.
Common pitfalls
Assuming diarization works perfectly. It doesn't. Plan for errors.
Over-reliance for authentication. Voice cloning defeats.
No fallback. If diarization fails, what happens?
Privacy oversight. Biometric data without proper handling.
Testing with demo data. Works in lab; fails with real mixed audio.
Integration with STT
Most STT providers offer diarization as an add-on:
- Same transcript API.
- Adds speaker label per segment.
- Slight latency increase.
- Slight cost increase.
Multi-party call routing
For a voice agent on a conference bridge:
- Diarize participants.
- Track per-participant context.
- Attribute statements to participants.
More complex than single-caller. Specialized use case.
Sample output
0.0-3.2s: SPEAKER_A: "Hi, I'm calling about my account."
3.3-6.1s: SPEAKER_B: "Sure, can I get your name?"
6.2-8.9s: SPEAKER_A: "It's under my wife's name..."
9.0-11.2s: SPEAKER_C: "This is Sarah."
11.3-13.5s: SPEAKER_A (interrupting): "Yeah what Sarah just said."
Agent now knows there are three speakers and who said what.
When not to bother
- Single caller, no background. Don't over-engineer.
- Low-stakes conversations. Errors don't matter much.
- Short calls. Diarization setup cost exceeds value.
Most voice agent deployments fall here. Skip diarization.
When to include
- Multi-party always possible. Family accounts, conference bridges.
- Compliance recording. Legal attribution required.
- Advanced analytics. Who drove conversation?
- Voice biometric auth.
Related reading
- Audio Codecs for Voice Agents: Opus, PCMU, and More
- Text-to-Speech in 2026: The State of the Art
- Latency Engineering for Real-Time Voice Agents
- Streaming Audio Over WebRTC for Voice Agents
FAQ
Can AI handle a family passing the phone around? Yes with diarization. Each speaker becomes a new turn.
Does diarization add latency? 5-50ms typically. Usually acceptable.
What about voice biometric spoofing? Real concern. Don't rely on voice alone for auth.
Can we do speaker attribution post-call? Offline diarization more accurate than real-time. Useful for analytics.
How often does diarization fail? In practice, 10-20% of multi-speaker segments have errors. Plan accordingly.

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.
More from Tyler Weitzman
View all →Open-Source vs Proprietary Voice Agent Stacks
The open-source voice AI stack in 2026 is genuinely good. Whisper and its derivatives handle STT. Open-weight LLMs like Llama 3/4, Qwen, Mistral handle the reasoning. Open-source TTS (XTTS, StyleTTS, Orpheus-class) handles output.
Build vs Buy: When to Build Your Own Voice Agent
Build-vs-buy for voice agents in 2026 is a different conversation than it was two years ago. Then, the open-source stack was rough and most serious deployments ended up building.
Voice Agents for Developer Support
Developer support is a strange category. Developers don't generally want to call anyone. They want Stack Overflow, they want clear docs, they want an LLM that can read their code.
Related reading
Audio Codecs for Voice Agents: Opus, PCMU, and More
Audio codecs determine the quality, bandwidth, and latency of every voice agent call. The choice between G.711, Opus, G.722, and others affects how your audio sounds over the line, how much bandwidth you consume, and how well STT and TTS perform.
Streaming Audio Over WebRTC for Voice Agents
WebRTC is the browser-native way to stream real-time audio. For voice agents embedded in web or mobile apps, it's often the best transport — lower latency than webhooks, built-in encryption, native NAT traversal, cross-platform.
Comparing Neural TTS Architectures
Neural TTS has evolved rapidly since 2018 — Tacotron gave way to WaveNet-style vocoders, which gave way to VALL-E-style neural codec models, which gave way to flow-matching and diffusion-based systems. Each architecture shift brought real quality improvements.
Voice AI, twice a month.
Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.
