> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pyannote.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# How to merge Diarization and STT results

> Learn how to combine diarization results with automatic speech recognition to get transcribed speaker segments.

<Card title="Get speaker-attributed transcripts with our diarize endpoint" icon="message-captions" href="/tutorials/speech-to-text-diarization" cta="View tutorial">
  Use our hosted open-source STT models with specialized reconciliation to obtain speaker-attributed transcripts.
</Card>

## Prerequisites

<Warning>
  Use this tutorial only if you have your own transcripts from another STT service (like OpenAI Whisper, Google Speech-to-Text, etc.) that you want to combine with diarization results.
</Warning>

* Diarization results from pyannoteAI
* Transcript segments from your chosen ASR service

## Step 1: Get diarization segments

First, get diarization segments from a diarization job (see [how to diarize](/tutorials/how-to-diarize-audio) and [Get job](/api-reference/get-job)).

<Tip>
  Set the `exclusive` parameter to `true` when requesting diarization for speaker-attributed transcripts.

  * This removes overlapping speech, ensuring each segment contains exactly one speaker, which makes it easier to align with STT/ASR results that don't normally work well with overlapping speech.
  * **Note**: Exclusive diarization results are provided in the `exclusiveDiarization` field of the job output, alongside the regular diarization results.
</Tip>

Here is an example of some diarization segments:

```json Example diarization segments theme={null}
[
  {
    "start": 0.5,
    "end": 5.2,
    "speaker": "SPEAKER_00"
  },
  {
    "start": 5.2,
    "end": 7.8,
    "speaker": "SPEAKER_01"
  },
  {
    "start": 8.1,
    "end": 12.4,
    "speaker": "SPEAKER_01"
  }
]
```

Note that the segments contain `start` and `end` timestamps in seconds along with speaker labels.

## Step 2: Get transcript segments with timestamps

Get the transcript segments with timestamps based on the same audio with your chosen ASR service. Here is an example of OpenAI `gpt-4o-transcribe` and `whisper-1` API transcript output, with segment timestamps:

```json Example OpenAI transcript segments theme={null}
{
  "task": "transcribe",
  "duration": 42.7,
  "text": "Agent: Thanks for calling OpenAI support.\nCustomer: Hi, I need help with diarization.",
  "segments": [
    {
      "type": "transcript.text.segment",
      "id": "seg_001",
      "start": 0.0,
      "end": 5.2,
      "text": "Thanks for calling OpenAI support."
    },
    {
      "type": "transcript.text.segment",
      "id": "seg_002",
      "start": 5.2,
      "end": 12.8,
      "text": "Hi, I need help with diarization."
    }
  ],
  "usage": {
    "type": "duration",
    "seconds": 43
  }
}
```

Here, the `segments` array contains `start` and `end` timestamps in seconds along with the transcribed text.

## Step 3: Merge results

Combine the diarization segments with the ASR transcript segments by aligning them based on their timestamps. You can use the following segment-level adaptation of WhisperX's current [`assign_word_speakers` logic in diarize.py](https://github.com/m-bain/whisperX/blob/main/whisperx/diarize.py):

```python merge_diarization_asr.py theme={null}
# Assuming diarization_segments is a list of dictionaries from Step 1
# Example: diarization_segments = [{"start": 0.5, "end": 3.2, "speaker": "SPEAKER_00"}, ...]
diarization_segments = sorted(diarization_segments, key=lambda x: x["start"])

# Assuming transcript_result is a dictionary from Step 2
# Example: transcript_result = {"segments": [{"start": 0.0, "end": 5.2, "text": "..."}, ...]}
transcript_segments = transcript_result.get("segments", [])

# Set to True to assign the nearest speaker when there is no overlap.
fill_nearest = False

for seg in transcript_segments:
    seg_start = seg.get("start", 0.0)
    seg_end = seg.get("end", 0.0)
    speaker_overlap: dict[str, float] = {}

    for dia in diarization_segments:
        intersection = min(dia["end"], seg_end) - max(dia["start"], seg_start)
        if intersection <= 0:
            continue

        speaker = dia["speaker"]
        speaker_overlap[speaker] = speaker_overlap.get(speaker, 0.0) + intersection

    if speaker_overlap:
        seg["speaker"] = max(speaker_overlap.items(), key=lambda x: x[1])[0]
        continue

    if fill_nearest and diarization_segments:
        midpoint = (seg_start + seg_end) / 2
        nearest = min(
            diarization_segments,
            key=lambda x: abs(((x["start"] + x["end"]) / 2) - midpoint),
        )
        seg["speaker"] = nearest["speaker"]
        continue

    seg["speaker"] = "UNKNOWN"
```

Resulting merged segments will look something like this:

```json Merged diarization + ASR segments theme={null}
[
  {
    "start": 0.0,
    "end": 5.2,
    "text": "Thanks for calling OpenAI support.",
    "speaker": "SPEAKER_00"
  },
  {
    "start": 5.2,
    "end": 12.8,
    "text": "Hi, I need help with diarization.",
    "speaker": "SPEAKER_01"
  }
]
```

Learn more about [WhisperX on GitHub](https://github.com/m-bain/whisperX/) and [OpenAI Whisper](https://github.com/openai/whisper)
