Skip to main content

Speaker diarization

Automatically detect each speaker in multi-speaker audio recordings.
Example diarization output
[
  {
    "speaker": "SPEAKER_00",
    "start": 10.0,
    "end": 15.0
  },
  {
    "speaker": "SPEAKER_01",
    "start": 12.5,
    "end": 14.0
  }
]
Key input parameters:
  • num_speakers: Expected number of speakers, leave empty for automatic detection
  • min_speakers/max_speakers: Range for speaker detection
  • exclusive: Enable exclusive diarization mode, equivalent to diarization but without overlapping speech. Useful for easier reconciliation with STT/ASR results.
  • model: Choose diarization model
  • confidence: Include confidence scores
Learn how to diarize an audio file

Speaker Identification vs. Diarization

Diarization answers “who spoke when?” with generic labels (SPEAKER_00, SPEAKER_01, etc.). Identification answers “who is speaking?” by recognizing specific known voices using voiceprints.

Voiceprint

Captures a speaker’s voice to identify that person in other audio recordings. Best practices:
  • Use clear, high-quality audio (max 30 seconds)
  • One voiceprint per speaker
Learn how to identify speakers with voiceprints

Confidence scores

Receive confidence scores for each speaker segment to assess reliability and perform human in the loop correction. Set the confidence parameter to true in your diarization or identification request. Understanding confidence scores

Overlapped speech detection

Detect when multiple speakers talk over each other and attribute overlapping speech to the correct speakers. Find overlapping speech by comparing timestamps of segments from different speakers. For example:
Example diarization output
[
  {
    "speaker": "SPEAKER_00",
    "start": 10.0,
    "end": 15.0
  },
  {
    "speaker": "SPEAKER_01",
    "start": 12.5,
    "end": 14.0
  }
]
In this example, both SPEAKER_00 and SPEAKER_01 are talking between 12.5-14.0 seconds. You can also use the segment timestamps to calculate statistics such as total speaker time per speaker, total overlap duration, and percentage of overlapped speech, etc.

STT Orchestration: Speaker-attributed transcripts

We host open-source transcription models like Nvidia Parakeet-tdt-0.6b-v3 with specialized STT + diarization reconciliation logic for speaker-attributed transcripts. To use this feature, make a request to the diarize API endpoint with the transcription: true flag. Learn more about speech to text with diarization Already have your own transcript? Merge it with our diarization results using this tutorial.