STT Orchestration: Speech-to-text with speaker diarization

New Enable speaker-attributed transcription in your diarization jobs. Our API runs the precision-2 diarization model and an STT model (Nvidia’s Parakeet-tdt-0.6b-v3 or OpenAI’s whisper-large-v3-turbo), then applies specialized STT reconciliation logic to match transcript segments and speakers with highly accurate results.

If you already have transcripts from another service and want to combine them with our diarization results, see how to merge diarization and STT results.

Prerequisites

Before you start, you’ll need a pyannoteAI account with credits or an active subscription, and a pyannoteAI API key. For help creating an account and getting your API key, see the quickstart guide. For pricing and charging details, see Billing.

Transcription is available as an add-on feature for diarization jobs. Please note the following constraints:

Diarization model: Only available with the precision-2 model
No identification: Cannot be used with speaker identification jobs
Supported languages: A total of 100 languages are supported, although this number may vary depending on the chosen transcription model; for a complete list, refer to the API Reference
STT models: Nvidia Parakeet-tdt-0.6b-v3 or OpenAI whisper-large-v3-turbo models for transcription

1. Create diarization job with transcription

Send a POST request to the diarize endpoint with transcription: true.

Learn more about the diarize endpoint in the how to diarize tutorial.

diarize_with_transcription.py

import requests

url = "https://api.pyannote.ai/v1/diarize"
api_key = "YOUR_API_KEY"  # In production, use environment variables: os.getenv("PYANNOTE_API_KEY")

headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
data = {
    "url": "https://files.pyannote.ai/marklex1min.wav",
    "transcription": True
}

response = requests.post(url, headers=headers, json=data)

if response.status_code != 200:
    print(f"Error: {response.status_code} - {response.text}")
else:
    print(response.json())

The response will include a jobId that you can use to track the job progress:

Example response

{
  "jobId": "3c8a89a5-dcc6-4edb-a75d-ffd64739674d",
  "status": "created"
}

To further configure transcriptions see Additional options.

2. Get the speaker attributed transcription results

Once you have a jobId, retrieve the results using either polling or webhooks. See how to get results for detailed examples.

Job results are automatically deleted after 24 hours, for all endpoints. Make sure to save your results in your own database.

When transcription is enabled, the completed job output will include both your standard diarization results (diarization object, and exclusiveDiarization if enabled) and two additional transcription fields in the output object:

Word-level transcription

Individual words with precise timestamps and speaker attribution:

wordLevelTranscription

{
  ...,
  "wordLevelTranscription": [
    {
      "start": 0.5,
      "end": 0.8,
      "text": "Hello",
      "speaker": "SPEAKER_00"
    },
    {
      "start": 0.9,
      "end": 1.2,
      "text": "everyone",
      "speaker": "SPEAKER_00"
    }
  ]
}

Turn-level transcription

Complete speaker turns with full text, ideal for creating readable transcripts:

turnLevelTranscription

{
  ...,
  "turnLevelTranscription": [
    {
      "start": 0.5,
      "end": 3.2,
      "text": "Hello everyone, welcome to the meeting.",
      "speaker": "SPEAKER_00"
    },
    {
      "start": 3.5,
      "end": 6.8,
      "text": "Hi, thanks for having me.",
      "speaker": "SPEAKER_01"
    }
  ]
}

3. Practical use cases

Word-level transcription

Word-level transcription is ideal for applications requiring precise timing:

Subtitles and captions: Generate accurate timestamps for each word to create synchronized subtitles
Video editing: Enable precise word-level navigation for editing tools
Detailed analysis: Analyze speaking patterns, word timing, and more
Search and indexing: Create searchable transcripts with exact word positions

Turn-level transcription

Turn-level transcription provides complete speaker utterances, making it more suitable for:

Meeting notes: Generate readable transcripts of conversations and meetings
Interview transcripts: Create clean, easy-to-read interview documentation
Customer service logs: Document support calls with speaker-attributed dialogue
Content summarization: Feed into AI summarization tools for generating meeting summaries

4. Format transcript example

Here’s an example of formatting the turn-level transcription into a readable transcript:

format_transcript.py

def format_transcript(turn_level_transcription):
    """Format turn-level transcription as a readable transcript"""
    transcript = []
    
    for turn in turn_level_transcription:
        speaker = turn["speaker"]
        text = turn["text"]
        timestamp = f"{int(turn['start'] // 60)}:{int(turn['start'] % 60):02d}"
        
        transcript.append(f"{speaker} ({timestamp}): {text}")
    
    return "\n\n".join(transcript)

# Example usage
output = response.json()["output"]
print(format_transcript(output["turnLevelTranscription"]))

This will produce a transcript like:

SPEAKER_00 (0:00): Hello everyone, welcome to the meeting.
SPEAKER_01 (0:03): Hi, thanks for having me.

5. Configure additional transcription options

Nvidia Parakeet-tdt-0.6b-v3 is the default transcription model, to choose a different one like OpenAI whisper-large-v3-turbo, explicitly set it using the transcriptionConfig object:

diarize_with_transcription_whisper.py

import requests

url = "https://api.pyannote.ai/v1/diarize"
api_key = "YOUR_API_KEY"  # In production, use environment variables: os.getenv("PYANNOTE_API_KEY")

headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
data = {
    "url": "https://files.pyannote.ai/marklex1min.wav",
    "transcription": True,
    "transcriptionConfig": {
      "model": "faster-whisper-large-v3-turbo"
    }
}

response = requests.post(url, headers=headers, json=data)

if response.status_code != 200:
    print(f"Error: {response.status_code} - {response.text}")
else:
    print(response.json())

For the full list of configuration options, refer to the API reference.

Limitations and considerations

Current limitations and future development:

Diarization model: Only available with the precision-2 model. Additional diarization model support is coming later.
STT model: Currently supports Nvidia Parakeet-tdt-0.6b-v3 and OpenAI whisper-large-v3-turbo. We are working on adding support for additional transcription models.
Not compatible with identification: Transcription cannot be used with speaker identification jobs yet.
Language support: Refer to the API Reference for the complete list of supported languages.

Processing time and cost:

Transcription jobs take longer to process than diarization-only jobs, as they run both diarization and speech recognition models.
Enabling transcription incurs higher costs beyond standard diarization with precision-2. View current pricing on your billing page in the dashboard, and see Billing for how charges are calculated.

Getting Started

Tutorials

Webhooks

Support

Administration

STT Orchestration: Speech-to-text with speaker diarization

Prerequisites

1. Create diarization job with transcription

2. Get the speaker attributed transcription results

Word-level transcription

Turn-level transcription

3. Practical use cases

Word-level transcription

Turn-level transcription

4. Format transcript example

5. Configure additional transcription options

Limitations and considerations

Getting Started

Tutorials

Webhooks

Support

Administration

​Prerequisites

​1. Create diarization job with transcription

​2. Get the speaker attributed transcription results

​Word-level transcription

​Turn-level transcription

​3. Practical use cases

​Word-level transcription

​Turn-level transcription

​4. Format transcript example

​5. Configure additional transcription options

​Limitations and considerations

Prerequisites

1. Create diarization job with transcription

2. Get the speaker attributed transcription results

Word-level transcription

Turn-level transcription

3. Practical use cases

Word-level transcription

Turn-level transcription

4. Format transcript example

5. Configure additional transcription options

Limitations and considerations