How to merge Diarization and STT/ASR results

While pyannoteAI does not offer built-in ASR capabilities, you can easily combine the diarization results with any ASR service (like OpenAI Whisper, Google Speech-to-Text, etc.) to obtain speaker attributed transcriptions.

Prerequisites

Diarization results from pyannoteAI
Transcript segments from your chosen ASR service

Step 1: Get diarization segments

First, get diarization segments from a diarization job (see how to diarize and Get job). Here is an example of some diarization segments:

Example diarization segments

[
  {
    "start": 0.5,
    "end": 5.2,
    "speaker": "SPEAKER_00"
  },
  {
    "start": 5.2,
    "end": 7.8,
    "speaker": "SPEAKER_01"
  },
  {
    "start": 8.1,
    "end": 12.4,
    "speaker": "SPEAKER_01"
  }
]

Note that the segments contain start and end timestamps in seconds along with speaker labels.

Step 2: Get transcript segments with timestamps

Get the transcript segments with timestamps based on the same audio with your chosen ASR service. Here is an example of OpenAI gpt-4o-transcribe and whisper-1 API transcript output, with segment timestamps:

Example OpenAI transcript segments

{
  "task": "transcribe",
  "duration": 42.7,
  "text": "Agent: Thanks for calling OpenAI support.\nCustomer: Hi, I need help with diarization.",
  "segments": [
    {
      "type": "transcript.text.segment",
      "id": "seg_001",
      "start": 0.0,
      "end": 5.2,
      "text": "Thanks for calling OpenAI support."
    },
    {
      "type": "transcript.text.segment",
      "id": "seg_002",
      "start": 5.2,
      "end": 12.8,
      "text": "Hi, I need help with diarization."
    }
  ],
  "usage": {
    "type": "duration",
    "seconds": 43
  }
}

Here, the segments array contains start and end timestamps in seconds along with the transcribed text.

Step 3: Merge results

Combine the diarization segments with the ASR transcript segments by aligning them based on their timestamps. You can use the following Python code taken from WhisperX - diarize.py to achieve this:

merge_diarization_asr.py

import numpy as np
import pandas as pd

# Assuming diarization_segments is a list of dictionaries from Step 1
# Example: diarization_segments = [{"start": 0.5, "end": 3.2, "speaker": "SPEAKER_00"}, ...]
diarize_df = pd.DataFrame(diarization_segments)

# Assuming transcript_result is a dictionary from Step 2
# Example: transcript_result = {"segments": [{"start": 0.0, "end": 5.2, "text": "..."}, ...]}
transcript_segments = transcript_result["segments"]

# If True, assign speakers even when there's no direct time overlap
fill_nearest = True

for seg in transcript_segments:
    # assign speaker to segment (if any)
    diarize_df['intersection'] = np.minimum(diarize_df['end'], seg['end']) - np.maximum(diarize_df['start'], seg['start'])
    diarize_df['union'] = np.maximum(diarize_df['end'], seg['end']) - np.minimum(diarize_df['start'], seg['start'])
    # remove no hit, otherwise we look for closest (even negative intersection...)
    if not fill_nearest:
        dia_tmp = diarize_df[diarize_df['intersection'] > 0]
    else:
        dia_tmp = diarize_df
    if len(dia_tmp) > 0:
        # sum over speakers
        speaker = dia_tmp.groupby("speaker")["intersection"].sum().sort_values(ascending=False).index[0]
        seg["speaker"] = speaker

Resulting merged segments will look something like this:

Merged diarization + ASR segments

[
  {
    "start": 0.0,
    "end": 5.2,
    "text": "Thanks for calling OpenAI support.",
    "speaker": "SPEAKER_00"
  },
  {
    "start": 5.2,
    "end": 12.8,
    "text": "Hi, I need help with diarization.",
    "speaker": "SPEAKER_01"
  }
]

Learn more about WhisperX on GitHub and OpenAI Whisper

Getting Started

Tutorials

Support

Webhooks

How to merge Diarization and STT/ASR results

Prerequisites

Step 1: Get diarization segments

Step 2: Get transcript segments with timestamps

Step 3: Merge results

Getting Started

Tutorials

Support

Webhooks

​Prerequisites

​Step 1: Get diarization segments

​Step 2: Get transcript segments with timestamps

​Step 3: Merge results

Prerequisites

Step 1: Get diarization segments

Step 2: Get transcript segments with timestamps

Step 3: Merge results