Use this tutorial only if you have your own transcripts from another STT service (like OpenAI Whisper, Google Speech-to-Text, etc.) that you want to combine with diarization results.
First, get diarization segments from a diarization job (see how to diarize and Get job).
Set the exclusive parameter to true when requesting diarization for speaker-attributed transcripts.
This removes overlapping speech, ensuring each segment contains exactly one speaker, which makes it easier to align with STT/ASR results that don’t normally work well with overlapping speech.
Note: Exclusive diarization results are provided in the exclusiveDiarization field of the job output, alongside the regular diarization results.
Get the transcript segments with timestamps based on the same audio with your chosen ASR service. Here is an example of OpenAI gpt-4o-transcribe and whisper-1 API transcript output, with segment timestamps:
Example OpenAI transcript segments
Copy
{ "task": "transcribe", "duration": 42.7, "text": "Agent: Thanks for calling OpenAI support.\nCustomer: Hi, I need help with diarization.", "segments": [ { "type": "transcript.text.segment", "id": "seg_001", "start": 0.0, "end": 5.2, "text": "Thanks for calling OpenAI support." }, { "type": "transcript.text.segment", "id": "seg_002", "start": 5.2, "end": 12.8, "text": "Hi, I need help with diarization." } ], "usage": { "type": "duration", "seconds": 43 }}
Here, the segments array contains start and end timestamps in seconds along with the transcribed text.
Combine the diarization segments with the ASR transcript segments by aligning them based on their timestamps. You can use the following Python code taken from WhisperX - diarize.py to achieve this:
merge_diarization_asr.py
Copy
import numpy as npimport pandas as pd# Assuming diarization_segments is a list of dictionaries from Step 1# Example: diarization_segments = [{"start": 0.5, "end": 3.2, "speaker": "SPEAKER_00"}, ...]diarize_df = pd.DataFrame(diarization_segments)# Assuming transcript_result is a dictionary from Step 2# Example: transcript_result = {"segments": [{"start": 0.0, "end": 5.2, "text": "..."}, ...]}transcript_segments = transcript_result["segments"]# If True, assign speakers even when there's no direct time overlapfill_nearest = Truefor seg in transcript_segments: # assign speaker to segment (if any) diarize_df['intersection'] = np.minimum(diarize_df['end'], seg['end']) - np.maximum(diarize_df['start'], seg['start']) diarize_df['union'] = np.maximum(diarize_df['end'], seg['end']) - np.minimum(diarize_df['start'], seg['start']) # remove no hit, otherwise we look for closest (even negative intersection...) if not fill_nearest: dia_tmp = diarize_df[diarize_df['intersection'] > 0] else: dia_tmp = diarize_df if len(dia_tmp) > 0: # sum over speakers speaker = dia_tmp.groupby("speaker")["intersection"].sum().sort_values(ascending=False).index[0] seg["speaker"] = speaker
Resulting merged segments will look something like this:
Merged diarization + ASR segments
Copy
[ { "start": 0.0, "end": 5.2, "text": "Thanks for calling OpenAI support.", "speaker": "SPEAKER_00" }, { "start": 5.2, "end": 12.8, "text": "Hi, I need help with diarization.", "speaker": "SPEAKER_01" }]