🔊 Speech Technology

Diarization: Knowing Who's Speaking in a Voice Conversation

Speaker diarization is the task of answering "who spoke when?" Given audio with multiple speakers, diarization outputs time-stamped segments labeled by speaker. For most voice agent use cases — one caller, one agent — diarization is trivial (channel-based separation works).

Tyler Weitzman
Tyler Weitzman
March 16, 2026 · 5 min read
Speechify

Speaker diarization is the task of answering "who spoke when?" Given audio with multiple speakers, diarization outputs time-stamped segments labeled by speaker. For most voice agent use cases — one caller, one agent — diarization is trivial (channel-based separation works). But when the scene gets complicated (multi-party calls, ambient voices, families on a shared phone), diarization becomes real infrastructure. This piece covers when you need it and how to handle it.

TL;DR

  • Most voice agents don't need diarization — channel separation suffices.
  • Where it matters: multi-person calls, shared phones, ambient background speech, legal/compliance recording.
  • Modern diarization is decent but imperfect — 5-15% speaker error rate typical.
  • Real-time diarization is harder than offline.
  • Combine with voice biometrics where identification (not just separation) matters.

Diarization vs identification

Diarization: "There were two speakers; speaker A said X, speaker B said Y."

Doesn't know who A and B are, just that they're different.

Identification: "Speaker A is Jamie, speaker B is Michael."

Adds identity — requires voice biometrics or prior recordings.

The channel-separation shortcut

For most voice agents:

  • Caller is on one RTP stream (ingress).
  • Agent is on another stream (egress).
  • Separation is mechanical.

No diarization algorithm needed.

When diarization matters

Multi-party calls. Family on speaker: "Mom says Thursday works."

Shared phones. Husband and wife sharing a line; different account access.

Ambient speech. Call from an office with background conversation.

Legal / compliance. Multi-party recording where who-said-what matters.

Conference bridge. Multiple external parties.

Real-time challenges

Offline diarization has access to the full audio. Real-time:

  • Limited context.
  • Must make decisions on the fly.
  • Higher error rate.

2026 real-time diarization error: 10-25% typical.

Providers

  • Deepgram Diarize.
  • AssemblyAI Diarization.
  • Pyannote (open-source).
  • AWS Transcribe speaker identification.
  • Azure Speaker Recognition.

Quality varies. Test on your audio.

The voice biometric layer

For identification (not just separation):

  • Voice biometrics compares current voice to stored profile.
  • Registered user profiles pre-enrolled.
  • Recognition confidence per speaker.

Voice cloning complicates: attacker with cloned voice passes biometric.

See how ai support agents should handle account verification.

Use cases in depth

Family account on shared phone.

  • Caller identifies: "This is Mom."
  • Voice biometric confirms.
  • Agent proceeds with that person's account.

Multi-language family.

  • Some family members speak Spanish, others English.
  • Agent responds in language of primary speaker per turn.

Conference calls with multiple decision-makers.

  • Agent captures whose opinions.
  • Structured meeting notes: "CTO said yes; CFO said need more info."

Challenges

Similar voices. Brothers, twins, similar-gender. High error rate.

Accent-accented speakers in the same call. Training data may not cover.

Noisy environments. Background talk confuses diarization.

Overlap. Two people talking at once. Hard for any system.

Accuracy metrics

Diarization Error Rate (DER). Percentage of time diarized incorrectly.

  • Under 10% DER: good.
  • 10-20% DER: acceptable for some use cases.
  • Over 20% DER: error-prone.

Missed speech. Diarization missed speech entirely.

False alarms. Diarization thought speech was present when not.

Speaker confusion. Multiple speakers mixed up.

Testing

  • Build test audio with known speaker timings.
  • Compare diarization output.
  • Calculate DER.
  • Identify failure patterns.

Pre-enrollment

For known-speaker use cases:

  • Pre-record samples.
  • Train speaker embeddings.
  • Real-time matches embeddings.

Better accuracy than generic diarization.

Privacy considerations

  • Voice recordings are biometric data.
  • Storage and processing regulated (GDPR, CCPA, BIPA).
  • Enrollment consent required.

Don't cavalier with voice biometrics.

Common pitfalls

Assuming diarization works perfectly. It doesn't. Plan for errors.

Over-reliance for authentication. Voice cloning defeats.

No fallback. If diarization fails, what happens?

Privacy oversight. Biometric data without proper handling.

Testing with demo data. Works in lab; fails with real mixed audio.

Integration with STT

Most STT providers offer diarization as an add-on:

  • Same transcript API.
  • Adds speaker label per segment.
  • Slight latency increase.
  • Slight cost increase.

Multi-party call routing

For a voice agent on a conference bridge:

  • Diarize participants.
  • Track per-participant context.
  • Attribute statements to participants.

More complex than single-caller. Specialized use case.

Sample output

0.0-3.2s: SPEAKER_A: "Hi, I'm calling about my account."
3.3-6.1s: SPEAKER_B: "Sure, can I get your name?"
6.2-8.9s: SPEAKER_A: "It's under my wife's name..."
9.0-11.2s: SPEAKER_C: "This is Sarah."
11.3-13.5s: SPEAKER_A (interrupting): "Yeah what Sarah just said."

Agent now knows there are three speakers and who said what.

When not to bother

  • Single caller, no background. Don't over-engineer.
  • Low-stakes conversations. Errors don't matter much.
  • Short calls. Diarization setup cost exceeds value.

Most voice agent deployments fall here. Skip diarization.

When to include

  • Multi-party always possible. Family accounts, conference bridges.
  • Compliance recording. Legal attribution required.
  • Advanced analytics. Who drove conversation?
  • Voice biometric auth.

FAQ

Can AI handle a family passing the phone around? Yes with diarization. Each speaker becomes a new turn.

Does diarization add latency? 5-50ms typically. Usually acceptable.

What about voice biometric spoofing? Real concern. Don't rely on voice alone for auth.

Can we do speaker attribution post-call? Offline diarization more accurate than real-time. Useful for analytics.

How often does diarization fail? In practice, 10-20% of multi-speaker segments have errors. Plan accordingly.

Tyler Weitzman
Tyler Weitzman
Co-Founder & Head of AI, Speechify

Tyler Weitzman is co-founder and Head of AI at Speechify. He has spent the past decade building the speech-synthesis stack that powers millions of users. Tyler writes about the engineering of real-time conversational systems — text-to-speech, speech recognition, latency budgets, model serving, and the architectural choices that separate prototypes from production-grade voice agents.

More from Tyler Weitzman

View all →

Related reading

Voice AI, twice a month.

Get the best of the SIMBA resources hub — new articles, trend notes, and operator guides. No spam.