precision-2 diarization model and Nvidia’s Parakeet-tdt-0.6b-v3 STT model, then applies specialized STT reconciliation logic to match transcript segments and speakers with highly accurate results.
If you already have transcripts from another service and want to combine them with our diarization results, see how to merge diarization and STT results.
Prerequisites
Before you start, you’ll need a pyannoteAI account with credits or an active subscription, and a pyannoteAI API key. For help creating an account and getting your API key, see the quickstart guide.Transcription is available as an add-on feature for diarization jobs. Please note the following constraints:
- Diarization model: Only available with the
precision-2model - No identification: Cannot be used with speaker identification jobs
- Supported languages: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Russian, or Ukrainian
- STT model: Nvidia Parakeet-tdt-0.6b-v3 model for transcription
1. Create diarization job with transcription
Send a POST request to the diarize endpoint withtranscription: true.
Learn more about the diarize endpoint in the how to diarize tutorial.
diarize_with_transcription.py
jobId that you can use to track the job progress:
Example response
2. Get the speaker attributed transcription results
Once you have ajobId, retrieve the results using either polling or webhooks. See how to get results for detailed examples.
When transcription is enabled, the completed job output will include both your standard diarization results (diarization object, and exclusiveDiarization if enabled) and two additional transcription fields in the output object:
Word-level transcription
Individual words with precise timestamps and speaker attribution:wordLevelTranscription
Turn-level transcription
Complete speaker turns with full text, ideal for creating readable transcripts:turnLevelTranscription
3. Practical use cases
Word-level transcription
Word-level transcription is ideal for applications requiring precise timing:- Subtitles and captions: Generate accurate timestamps for each word to create synchronized subtitles
- Video editing: Enable precise word-level navigation for editing tools
- Detailed analysis: Analyze speaking patterns, word timing, and more
- Search and indexing: Create searchable transcripts with exact word positions
Turn-level transcription
Turn-level transcription provides complete speaker utterances, making it more suitable for:- Meeting notes: Generate readable transcripts of conversations and meetings
- Interview transcripts: Create clean, easy-to-read interview documentation
- Customer service logs: Document support calls with speaker-attributed dialogue
- Content summarization: Feed into AI summarization tools for generating meeting summaries
4. Format transcript example
Here’s an example of formatting the turn-level transcription into a readable transcript:format_transcript.py
Limitations and considerations
Current limitations and future development:
- Diarization model: Only available with the
precision-2model. Additional diarization model support is coming later. - STT model: Currently supports Nvidia Parakeet-tdt-0.6b-v3 only. We are working on adding support for additional transcription models.
- Not compatible with identification: Transcription cannot be used with speaker identification jobs yet.
- Language support: See the Prerequisites section for the full list of supported languages.