> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pyannote.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# STT Orchestration: Speech-to-text with speaker diarization

> Learn how to get transcribed speaker segments with automatic speech recognition and diarization in a single API call.

<Badge icon="sparkles" color="green" shape="pill" size="lg">New</Badge>

Enable speaker-attributed transcription in your diarization jobs. Our API runs the `precision-2` diarization model and an STT model (Nvidia's Parakeet-tdt-0.6b-v3 or OpenAI's whisper-large-v3-turbo), then applies specialized STT reconciliation logic to match transcript segments and speakers with highly accurate results.

<Info>
  If you already have transcripts from another service and want to combine them with our diarization results, see [how to merge diarization and STT results](/tutorials/diarization-asr-merge).
</Info>

### Prerequisites

Before you start, you'll need a pyannoteAI account with credits or an active subscription, and a pyannoteAI API key. For help creating an account and getting your API key, see the [quickstart guide](/quickstart). For pricing and charging details, see [Billing](/administration/billing).

<Note>
  Transcription is available as an add-on feature for diarization jobs. Please note the following constraints:

  * **Diarization model**: Only available with the `precision-2` model
  * **No identification**: Cannot be used with speaker identification jobs
  * **Supported languages**: A total of 100 languages are supported, although this number may vary depending on the chosen transcription model; for a complete list, refer to the [API Reference](/api-reference/diarize#body-transcription-config)
  * **STT models**: [Nvidia Parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) or [OpenAI whisper-large-v3-turbo](https://huggingface.co/dropbox-dash/faster-whisper-large-v3-turbo) models for transcription
</Note>

## 1. Create diarization job with transcription

Send a POST request to the diarize endpoint with `transcription: true`.

<Info>
  Learn more about the diarize endpoint in the [how to diarize tutorial](/tutorials/how-to-diarize-audio).
</Info>

<CodeGroup dropdown>
  ```python diarize_with_transcription.py theme={null}
  import requests

  url = "https://api.pyannote.ai/v1/diarize"
  api_key = "YOUR_API_KEY"  # In production, use environment variables: os.getenv("PYANNOTE_API_KEY")

  headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
  data = {
      "url": "https://files.pyannote.ai/marklex1min.wav",
      "transcription": True
  }

  response = requests.post(url, headers=headers, json=data)

  if response.status_code != 200:
      print(f"Error: {response.status_code} - {response.text}")
  else:
      print(response.json())
  ```

  ```bash theme={null}
  # Set your API key as an environment variable in production
  # export PYANNOTE_API_KEY="your_api_key_here"
  curl -X POST "https://api.pyannote.ai/v1/diarize" \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "url": "https://files.pyannote.ai/marklex1min.wav",
      "transcription": true
    }'
  ```

  ```typescript diarize_with_transcription.ts theme={null}
  const url = "https://api.pyannote.ai/v1/diarize";
  const apiKey = "YOUR_API_KEY"; // In production, use environment variables: process.env.PYANNOTE_API_KEY
  const headers = {
    Authorization: `Bearer ${apiKey}`,
    "Content-Type": "application/json",
  };
  const data = {
    url: "https://files.pyannote.ai/marklex1min.wav",
    transcription: true,
  };

  const response = await fetch(url, {
    method: "POST",
    headers,
    body: JSON.stringify(data),
  });

  if (!response.ok) {
    console.error(`Error: ${response.status} - ${await response.text()}`);
  } else {
    console.log(await response.json());
  }
  ```
</CodeGroup>

The response will include a `jobId` that you can use to track the job progress:

```json Example response theme={null}
{
  "jobId": "3c8a89a5-dcc6-4edb-a75d-ffd64739674d",
  "status": "created"
}
```

<Info>
  To further configure transcriptions see [Additional options](/tutorials/speech-to-text-diarization#5-configure-additional-transcription-options).
</Info>

***

## 2. Get the speaker attributed transcription results

Once you have a `jobId`, retrieve the results using either polling or webhooks. See [how to get results](/tutorials/how-to-diarize-audio#2-get-diarization-result) for detailed examples.

<Warning>
  **Job results are automatically deleted after 24 hours**, for all endpoints. Make sure to save your results in your own database.
</Warning>

When transcription is enabled, the completed job output will include both your standard diarization results (`diarization` object, and `exclusiveDiarization` if enabled) **and** two additional transcription fields in the `output` object:

### Word-level transcription

Individual words with precise timestamps and speaker attribution:

```json wordLevelTranscription theme={null}
{
  ...,
  "wordLevelTranscription": [
    {
      "start": 0.5,
      "end": 0.8,
      "text": "Hello",
      "speaker": "SPEAKER_00"
    },
    {
      "start": 0.9,
      "end": 1.2,
      "text": "everyone",
      "speaker": "SPEAKER_00"
    }
  ]
}
```

### Turn-level transcription

Complete speaker turns with full text, ideal for creating readable transcripts:

```json turnLevelTranscription theme={null}
{
  ...,
  "turnLevelTranscription": [
    {
      "start": 0.5,
      "end": 3.2,
      "text": "Hello everyone, welcome to the meeting.",
      "speaker": "SPEAKER_00"
    },
    {
      "start": 3.5,
      "end": 6.8,
      "text": "Hi, thanks for having me.",
      "speaker": "SPEAKER_01"
    }
  ]
}
```

***

## 3. Practical use cases

### Word-level transcription

Word-level transcription is ideal for applications requiring precise timing:

* **Subtitles and captions**: Generate accurate timestamps for each word to create synchronized subtitles
* **Video editing**: Enable precise word-level navigation for editing tools
* **Detailed analysis**: Analyze speaking patterns, word timing, and more
* **Search and indexing**: Create searchable transcripts with exact word positions

### Turn-level transcription

Turn-level transcription provides complete speaker utterances, making it more suitable for:

* **Meeting notes**: Generate readable transcripts of conversations and meetings
* **Interview transcripts**: Create clean, easy-to-read interview documentation
* **Customer service logs**: Document support calls with speaker-attributed dialogue
* **Content summarization**: Feed into AI summarization tools for generating meeting summaries

***

## 4. Format transcript example

Here's an example of formatting the turn-level transcription into a readable transcript:

<CodeGroup dropdown>
  ```python format_transcript.py theme={null}
  def format_transcript(turn_level_transcription):
      """Format turn-level transcription as a readable transcript"""
      transcript = []
      
      for turn in turn_level_transcription:
          speaker = turn["speaker"]
          text = turn["text"]
          timestamp = f"{int(turn['start'] // 60)}:{int(turn['start'] % 60):02d}"
          
          transcript.append(f"{speaker} ({timestamp}): {text}")
      
      return "\n\n".join(transcript)

  # Example usage
  output = response.json()["output"]
  print(format_transcript(output["turnLevelTranscription"]))
  ```

  ```typescript format_transcript.ts theme={null}
  function formatTranscript(turnLevelTranscription: any[]) {
    const transcript = turnLevelTranscription.map((turn) => {
      const speaker = turn.speaker;
      const text = turn.text;
      const minutes = Math.floor(turn.start / 60);
      const seconds = Math.floor(turn.start % 60).toString().padStart(2, '0');
      const timestamp = `${minutes}:${seconds}`;
      
      return `${speaker} (${timestamp}): ${text}`;
    });
    
    return transcript.join('\n\n');
  }

  // Example usage
  const output = response.output;
  console.log(formatTranscript(output.turnLevelTranscription));
  ```
</CodeGroup>

This will produce a transcript like:

```
SPEAKER_00 (0:00): Hello everyone, welcome to the meeting.
SPEAKER_01 (0:03): Hi, thanks for having me.
```

***

## 5. Configure additional transcription options

[Nvidia Parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) is the default transcription model, to choose a different one like [OpenAI whisper-large-v3-turbo](https://huggingface.co/dropbox-dash/faster-whisper-large-v3-turbo), explicitly set it using the `transcriptionConfig` object:

<CodeGroup dropdown>
  ```python diarize_with_transcription_whisper.py theme={null}
  import requests

  url = "https://api.pyannote.ai/v1/diarize"
  api_key = "YOUR_API_KEY"  # In production, use environment variables: os.getenv("PYANNOTE_API_KEY")

  headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
  data = {
      "url": "https://files.pyannote.ai/marklex1min.wav",
      "transcription": True,
      "transcriptionConfig": {
        "model": "faster-whisper-large-v3-turbo"
      }
  }

  response = requests.post(url, headers=headers, json=data)

  if response.status_code != 200:
      print(f"Error: {response.status_code} - {response.text}")
  else:
      print(response.json())
  ```

  ```bash theme={null}
  # Set your API key as an environment variable in production
  # export PYANNOTE_API_KEY="your_api_key_here"
  curl -X POST "https://api.pyannote.ai/v1/diarize" \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "url": "https://files.pyannote.ai/marklex1min.wav",
      "transcription": true,
      "transcriptionConfig": {
        "model": "faster-whisper-large-v3-turbo"
      }
    }'
  ```

  ```typescript diarize_with_transcription_whisper.ts theme={null}
  const url = "https://api.pyannote.ai/v1/diarize";
  const apiKey = "YOUR_API_KEY"; // In production, use environment variables: process.env.PYANNOTE_API_KEY
  const headers = {
    Authorization: `Bearer ${apiKey}`,
    "Content-Type": "application/json",
  };
  const data = {
    url: "https://files.pyannote.ai/marklex1min.wav",
    transcription: true,
    transcriptionConfig: {
      model: "faster-whisper-large-v3-turbo"
    }
  };

  const response = await fetch(url, {
    method: "POST",
    headers,
    body: JSON.stringify(data),
  });

  if (!response.ok) {
    console.error(`Error: ${response.status} - ${await response.text()}`);
  } else {
    console.log(await response.json());
  }
  ```
</CodeGroup>

<Info>
  For the full list of configuration options, refer to the [API reference](/api-reference/diarize).
</Info>

***

## Limitations and considerations

<Callout>
  **Current limitations and future development:**

  * **Diarization model**: Only available with the `precision-2` model. Additional diarization model support is coming later.
  * **STT model**: Currently supports [Nvidia Parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) and [OpenAI whisper-large-v3-turbo](https://huggingface.co/dropbox-dash/faster-whisper-large-v3-turbo). We are working on adding support for additional transcription models.
  * **Not compatible with identification**: Transcription cannot be used with speaker identification jobs yet.
  * **Language support**: Refer to the [API Reference](/api-reference/diarize#body-transcription-config) for the complete list of supported languages.
</Callout>

<Warning>
  **Processing time and cost:**

  * Transcription jobs take longer to process than diarization-only jobs, as they run both diarization and speech recognition models.
  * Enabling transcription incurs higher costs beyond standard diarization with `precision-2`. View current pricing on your billing page in the [dashboard](https://dashboard.pyannote.ai), and see [Billing](/administration/billing) for how charges are calculated.
</Warning>
