Skip to main content

Speaker diarization

Automatically detect each speaker in multi-speaker audio recordings.
Example diarization output
[
  {
    "speaker": "SPEAKER_00",
    "start": 10.0,
    "end": 15.0
  },
  {
    "speaker": "SPEAKER_01",
    "start": 12.5,
    "end": 14.0
  }
]
Key input parameters:
  • num_speakers: Expected number of speakers, leave empty for automatic detection
  • min_speakers/max_speakers: Range for speaker detection
  • exclusive: Enable exclusive diarization mode, equivalent to diarization but without overlapping speech. Useful for easier reconciliation with STT/ASR results.
  • model: Choose diarization model
  • confidence: Include confidence scores

Speaker Identification vs. Diarization

Diarization answers “who spoke when?” with generic labels (SPEAKER_00, SPEAKER_01, etc.). Identification answers “who is speaking?” by recognizing specific known voices using voiceprints.

Voiceprint

Captures a speaker’s voice to identify that person in other audio recordings. Best practices:
  • Use clear, high-quality audio (max 30 seconds)
  • One voiceprint per speaker

Confidence Scores

Receive confidence scores for each speaker segment to assess reliability and perform human in the loop correction. Set the confidence parameter to true in your diarization or identification request.

Overlapped speech detection

Detect when multiple speakers talk over each other and attribute overlapping speech to the correct speakers. Find overlapping speech by comparing timestamps of segments from different speakers. For example:
Example diarization output
[
  {
    "speaker": "SPEAKER_00",
    "start": 10.0,
    "end": 15.0
  },
  {
    "speaker": "SPEAKER_01",
    "start": 12.5,
    "end": 14.0
  }
]
In this example, both SPEAKER_00 and SPEAKER_01 are talking between 12.5-14.0 seconds.
You can also use the segment timestamps to calculate statistics such as total speaker time per speaker, total overlap duration, and percentage of overlapped speech, etc.

Transcription (STT/ASR)

We currently do not offer transcription (STT/ASR) services. However, you can easily integrate pyannoteAI with popular transcription services like Whisper or Nvidia Parakeet to build a complete speaker diarization and transcription pipeline.