Documentation Index
Fetch the complete documentation index at: https://docs.pyannote.ai/llms.txt
Use this file to discover all available pages before exploring further.
Streaming is currently in beta.To request access, fill in the form at pyannoteai.typeform.com/streaming-beta.If creating a stream session returns a
403 Streaming is disabled error, your team does not yet have access. Use the form above to request it.Introduction
Streaming diarization lets you identify who is speaking in real-time over a WebSocket connection. As you stream audio, the API continuously emits speaker turn events, telling you which speaker started or stopped talking and when. Use cases include live captioning, real-time meeting assistants, call center monitoring, and any application that needs to attribute speech to speakers without waiting for the full audio to be recorded.Auth
All requests to the streaming API require a valid API key. You can generate an API key from your pyannote.ai dashboard. Pass your key as a Bearer token in theAuthorization header when creating a stream session.
Quickstart
Getting real-time diarization takes three steps: 1. Create a stream sessionurl. You can hand this URL directly to your end-user’s client, it only grants access to this one stream and carries no team credentials or API key.
2. Connect to the WebSocket URL
Open a WebSocket connection to the url returned above. The connection authenticates automatically via the token embedded in the URL, no additional headers needed.
Cold-starts may delay the WebSocket connection by a few seconds. Wait for the connection to be fully open before sending audio, the
open event (or equivalent in your WebSocket client) is your signal that it is safe to start streaming.Input events
audio_chunk
Send audio as raw binary WebSocket frames. The audio must meet these requirements:| Property | Value |
|---|---|
| Format | PCM float 32-bit little-endian (pcm_f32le) |
| Sample rate | 16 kHz |
| Channels | Mono |
| Chunk duration | 100 ms |
end_of_stream
When you have no more audio to send, signal the end of the stream by sending a JSON text frame:end_of_stream.
Using this message is recommended over abruptly closing the socket, which may cause final outputs to be lost.
Output events
The server emits JSON text frames with the following event types:diarization_speaker_start
Emitted when a speaker begins a turn.
diarization_speaker_end
Emitted when a speaker’s turn ends.
timestamp is in seconds, relative to the start of the stream. speaker is a stable string label for the duration of the session.
error
Emitted when the server encounters a problem processing a frame (e.g. wrong chunk size).
Pricing
Streaming is free during the beta period.Limits
| Limit | Value |
|---|---|
| Concurrent running streams per team | 10 streams |
| Idle timeout (no audio received) | 5 seconds |
| Maximum stream duration per stream | 5 hours |