Skip to main content
New Enable speaker-attributed transcription in your diarization jobs. Our API runs the precision-2 diarization model and Nvidia’s Parakeet-tdt-0.6b-v3 STT model, then applies specialized STT reconciliation logic to match transcript segments and speakers with highly accurate results.
If you already have transcripts from another service and want to combine them with our diarization results, see how to merge diarization and STT results.

Prerequisites

Before you start, you’ll need a pyannoteAI account with credits or an active subscription, and a pyannoteAI API key. For help creating an account and getting your API key, see the quickstart guide.
Transcription is available as an add-on feature for diarization jobs. Please note the following constraints:
  • Diarization model: Only available with the precision-2 model
  • No identification: Cannot be used with speaker identification jobs
  • Supported languages: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Russian, or Ukrainian
  • STT model: Nvidia Parakeet-tdt-0.6b-v3 model for transcription

1. Create diarization job with transcription

Send a POST request to the diarize endpoint with transcription: true.
Learn more about the diarize endpoint in the how to diarize tutorial.
diarize_with_transcription.py
import requests

url = "https://api.pyannote.ai/v1/diarize"
api_key = "YOUR_API_KEY"  # In production, use environment variables: os.getenv("PYANNOTE_API_KEY")

headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
data = {
    "url": "https://files.pyannote.ai/marklex1min.wav",
    "transcription": True
}

response = requests.post(url, headers=headers, json=data)

if response.status_code != 200:
    print(f"Error: {response.status_code} - {response.text}")
else:
    print(response.json())
The response will include a jobId that you can use to track the job progress:
Example response
{
  "jobId": "3c8a89a5-dcc6-4edb-a75d-ffd64739674d",
  "status": "created"
}

2. Get the speaker attributed transcription results

Once you have a jobId, retrieve the results using either polling or webhooks. See how to get results for detailed examples.
Job results are automatically deleted after 24 hours, for all endpoints. Make sure to save your results in your own database.
When transcription is enabled, the completed job output will include both your standard diarization results (diarization object, and exclusiveDiarization if enabled) and two additional transcription fields in the output object:

Word-level transcription

Individual words with precise timestamps and speaker attribution:
wordLevelTranscription
{
  ...,
  "wordLevelTranscription": [
    {
      "start": 0.5,
      "end": 0.8,
      "text": "Hello",
      "speaker": "SPEAKER_00"
    },
    {
      "start": 0.9,
      "end": 1.2,
      "text": "everyone",
      "speaker": "SPEAKER_00"
    }
  ]
}

Turn-level transcription

Complete speaker turns with full text, ideal for creating readable transcripts:
turnLevelTranscription
{
  ...,
  "turnLevelTranscription": [
    {
      "start": 0.5,
      "end": 3.2,
      "text": "Hello everyone, welcome to the meeting.",
      "speaker": "SPEAKER_00"
    },
    {
      "start": 3.5,
      "end": 6.8,
      "text": "Hi, thanks for having me.",
      "speaker": "SPEAKER_01"
    }
  ]
}

3. Practical use cases

Word-level transcription

Word-level transcription is ideal for applications requiring precise timing:
  • Subtitles and captions: Generate accurate timestamps for each word to create synchronized subtitles
  • Video editing: Enable precise word-level navigation for editing tools
  • Detailed analysis: Analyze speaking patterns, word timing, and more
  • Search and indexing: Create searchable transcripts with exact word positions

Turn-level transcription

Turn-level transcription provides complete speaker utterances, making it more suitable for:
  • Meeting notes: Generate readable transcripts of conversations and meetings
  • Interview transcripts: Create clean, easy-to-read interview documentation
  • Customer service logs: Document support calls with speaker-attributed dialogue
  • Content summarization: Feed into AI summarization tools for generating meeting summaries

4. Format transcript example

Here’s an example of formatting the turn-level transcription into a readable transcript:
format_transcript.py
def format_transcript(turn_level_transcription):
    """Format turn-level transcription as a readable transcript"""
    transcript = []
    
    for turn in turn_level_transcription:
        speaker = turn["speaker"]
        text = turn["text"]
        timestamp = f"{int(turn['start'] // 60)}:{int(turn['start'] % 60):02d}"
        
        transcript.append(f"{speaker} ({timestamp}): {text}")
    
    return "\n\n".join(transcript)

# Example usage
output = response.json()["output"]
print(format_transcript(output["turnLevelTranscription"]))
This will produce a transcript like:
SPEAKER_00 (0:00): Hello everyone, welcome to the meeting.
SPEAKER_01 (0:03): Hi, thanks for having me.

Limitations and considerations

Current limitations and future development:
  • Diarization model: Only available with the precision-2 model. Additional diarization model support is coming later.
  • STT model: Currently supports Nvidia Parakeet-tdt-0.6b-v3 only. We are working on adding support for additional transcription models.
  • Not compatible with identification: Transcription cannot be used with speaker identification jobs yet.
  • Language support: See the Prerequisites section for the full list of supported languages.
Processing time and cost:
  • Transcription jobs take longer to process than diarization-only jobs, as they run both diarization and speech recognition models.
  • Enabling transcription incurs higher costs beyond standard diarization with precision-2. View current pricing on your billing page in the dashboard.