Speaker diarization
Automatically detect each speaker in multi-speaker audio recordings.Example diarization output
num_speakers: Expected number of speakers, leave empty for automatic detectionmin_speakers/max_speakers: Range for speaker detectionexclusive: Enable exclusive diarization mode, equivalent to diarization but without overlapping speech. Useful for easier reconciliation with STT/ASR results.model: Choose diarization modelconfidence: Include confidence scores
Speaker Identification vs. Diarization
Diarization answers “who spoke when?” with generic labels (SPEAKER_00, SPEAKER_01, etc.).
Identification answers “who is speaking?” by recognizing specific known voices using voiceprints.
Voiceprint
Captures a speaker’s voice to identify that person in other audio recordings. Best practices:- Use clear, high-quality audio (max 30 seconds)
- One voiceprint per speaker
Confidence Scores
Receive confidence scores for each speaker segment to assess reliability and perform human in the loop correction. Set theconfidence parameter to true in your diarization or identification request.
Overlapped speech detection
Detect when multiple speakers talk over each other and attribute overlapping speech to the correct speakers. Find overlapping speech by comparing timestamps of segments from different speakers. For example:Example diarization output
SPEAKER_00 and SPEAKER_01 are talking between 12.5-14.0 seconds.
You can also use the segment timestamps to calculate statistics such as total
speaker time per speaker, total overlap duration, and percentage of overlapped
speech, etc.
Transcription (STT/ASR)
We currently do not offer transcription (STT/ASR) services. However, you can
easily integrate pyannoteAI with popular transcription services like Whisper
or Nvidia Parakeet to build a complete speaker diarization and transcription
pipeline.