Skip to main content
While pyannoteAI does not offer built-in ASR capabilities, you can easily combine the diarization results with any ASR service (like OpenAI Whisper, Google Speech-to-Text, etc.) to obtain speaker attributed transcriptions.

Prerequisites

  • Diarization results from pyannoteAI
  • Transcript segments from your chosen ASR service

Step 1: Get diarization segments

First, get diarization segments from a diarization job (see how to diarize and Get job). Here is an example of some diarization segments:
Example diarization segments
[
  {
    "start": 0.5,
    "end": 5.2,
    "speaker": "SPEAKER_00"
  },
  {
    "start": 5.2,
    "end": 7.8,
    "speaker": "SPEAKER_01"
  },
  {
    "start": 8.1,
    "end": 12.4,
    "speaker": "SPEAKER_01"
  }
]
Note that the segments contain start and end timestamps in seconds along with speaker labels.

Step 2: Get transcript segments with timestamps

Get the transcript segments with timestamps based on the same audio with your chosen ASR service. Here is an example of OpenAI gpt-4o-transcribe and whisper-1 API transcript output, with segment timestamps:
Example OpenAI transcript segments
{
  "task": "transcribe",
  "duration": 42.7,
  "text": "Agent: Thanks for calling OpenAI support.\nCustomer: Hi, I need help with diarization.",
  "segments": [
    {
      "type": "transcript.text.segment",
      "id": "seg_001",
      "start": 0.0,
      "end": 5.2,
      "text": "Thanks for calling OpenAI support."
    },
    {
      "type": "transcript.text.segment",
      "id": "seg_002",
      "start": 5.2,
      "end": 12.8,
      "text": "Hi, I need help with diarization."
    }
  ],
  "usage": {
    "type": "duration",
    "seconds": 43
  }
}
Here, the segments array contains start and end timestamps in seconds along with the transcribed text.

Step 3: Merge results

Combine the diarization segments with the ASR transcript segments by aligning them based on their timestamps. You can use the following Python code taken from WhisperX - diarize.py to achieve this:
merge_diarization_asr.py
import numpy as np
import pandas as pd

# Assuming diarization_segments is a list of dictionaries from Step 1
# Example: diarization_segments = [{"start": 0.5, "end": 3.2, "speaker": "SPEAKER_00"}, ...]
diarize_df = pd.DataFrame(diarization_segments)

# Assuming transcript_result is a dictionary from Step 2
# Example: transcript_result = {"segments": [{"start": 0.0, "end": 5.2, "text": "..."}, ...]}
transcript_segments = transcript_result["segments"]

# If True, assign speakers even when there's no direct time overlap
fill_nearest = True

for seg in transcript_segments:
    # assign speaker to segment (if any)
    diarize_df['intersection'] = np.minimum(diarize_df['end'], seg['end']) - np.maximum(diarize_df['start'], seg['start'])
    diarize_df['union'] = np.maximum(diarize_df['end'], seg['end']) - np.minimum(diarize_df['start'], seg['start'])
    # remove no hit, otherwise we look for closest (even negative intersection...)
    if not fill_nearest:
        dia_tmp = diarize_df[diarize_df['intersection'] > 0]
    else:
        dia_tmp = diarize_df
    if len(dia_tmp) > 0:
        # sum over speakers
        speaker = dia_tmp.groupby("speaker")["intersection"].sum().sort_values(ascending=False).index[0]
        seg["speaker"] = speaker
Resulting merged segments will look something like this:
Merged diarization + ASR segments
[
  {
    "start": 0.0,
    "end": 5.2,
    "text": "Thanks for calling OpenAI support.",
    "speaker": "SPEAKER_00"
  },
  {
    "start": 5.2,
    "end": 12.8,
    "text": "Hi, I need help with diarization.",
    "speaker": "SPEAKER_01"
  }
]
Learn more about WhisperX on GitHub and OpenAI Whisper