Skip to main content
Speaker identification is the process of determining who is speaking in an audio file by comparing their voice characteristics against known voiceprints. Unlike diarization which only separates speakers into generic labels (SPEAKER_00, SPEAKER_01, etc.), identification assigns specific identities to speakers.

What are voiceprints?

A voiceprint is a unique digital representation of a person’s voice characteristics, similar to a fingerprint but for voice. It captures the distinctive features of how someone speaks, allowing the system to recognize that person in future audio recordings.
Voiceprints are for identification only - they do not improve the accuracy of diarization. Diarization separates speakers, while identification assigns names/labels to those speakers.

Voiceprint requirements

  • One voiceprint per speaker: Create only one voiceprint for each person.
  • Single speaker only: The recording must contain only the target speaker’s voice with no overlapping speakers.
  • Maximum duration: Audio samples must be at most 30 seconds long for creating voiceprints.
  • Consistent speaking style: The voiceprint should capture the person’s normal speaking voice.
  • Language: Our models are language agnostic, so voiceprints can be created in any spoken language.

Prerequisites

Before you start, you’ll need:
  • pyannoteAI account with credit or active subscription
  • An API key
  • An audio recording of a single speaker for the voiceprint creation
  • An audio recording with multiple speakers for diarization + identification
For help creating an account and getting your API key, see the quickstart guide.

1. Create a voiceprint

First, create a voiceprint for each speaker you want to identify. This is a one-time process for each person. Send a POST request to the voiceprint endpoint with an audio file containing the speaker’s voice.
create_voiceprint.py
import requests

url = "https://api.pyannote.ai/v1/voiceprint"
api_key = "YOUR_API_KEY"  # In production, use environment variables: os.getenv("PYANNOTE_API_KEY")

headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
data = {"url": "https://example.com/speaker-voice-sample.wav"}

response = requests.post(url, headers=headers, json=data)

if response.status_code != 200:
    print(f"Error: {response.status_code} - {response.text}")
else:
    print(response.json())
The response will include a jobId to track the voiceprint creation:
Example response
{
  "jobId": "3c8a89a5-dcc6-4edb-a75d-ffd64739674d",
  "status": "created"
}

Get voiceprint results

To retrieve the voiceprint results, use the same polling or webhook approach described in the How to diarize an audio file tutorial. The process works identically for voiceprint jobs.
Save voiceprints to your own data storage
  • Job outputs (including voiceprints) are automatically deleted after 24 hours.
  • Voiceprints are reusable, so store them securely for future identification requests.
example job voiceprint output
{
  "jobId": "3c8a89a5-dcc6-4edb-a75d-ffd64739674d",
  "status": "succeeded",
  "createdAt": "2024-02-20T12:00:00Z",
  "updatedAt": "2024-02-20T12:00:00Z",
  "output": {
    "voiceprint": "U29tZVZvaWNlUHJpbnREYXRhMQ=="
  }
}

2. Identify speakers in audio

Now that you have a voiceprint, you can identify a speaker in new audio recordings. Send a POST request to the identify endpoint with the audio file URL and the voiceprints you want to match against.
identify_speakers.py
import requests

url = "https://api.pyannote.ai/v1/identify"
api_key = "YOUR_API_KEY"

headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
data = {
    "url": "https://example.com/meeting-audio.wav",
    "voiceprints": [
        {
            "label": "John Doe", # The speaker label you want to assign
            "voiceprint": "U29tZVZvaWNlUHJpbnREYXRhMQ=="  # Replace with actual voiceprint
        },
        # Add more voiceprints as needed
    ],
    # Optional matching parameters
    "matching": {
        "threshold": 50,  # Only match if confidence is 50% or higher
        "exclusive": True  # Prevent multiple speakers matching same voiceprint
    }
}

response = requests.post(url, headers=headers, json=data)

if response.status_code != 200:
    print(f"Error: {response.status_code} - {response.text}")
else:
    print(response.json())
The response will include a jobId for tracking the identification job:
Example response
{
  "jobId": "4d9b9ab6-edd7-5feca-b86e-gee75840775e",
  "status": "created"
}
Multiple voiceprints: You can add multiple voiceprints for different people in the same request. Each voiceprint must have a unique label. The system will attempt to match all provided voiceprints against the audio.
Voiceprint selection: Voiceprints may match to speakers even when the person isn’t actually in the audio. Be cautious about including voiceprints of people who may not be present. Review confidence scores carefully and set appropriate thresholds when unsure about speaker presence.

3. Get identification results

To retrieve the identification results, use the same polling or webhook approach described in the How to diarize an audio file tutorial. The process works identically for identification jobs.
Example identification output
{
  "jobId": "4d9b9ab6-edd7-5feca-b86e-gee75840775e",
  "status": "succeeded",
  "createdAt": "2025-11-06T09:07:49.932Z",
  "updatedAt": "2025-11-06T09:07:53.229Z",
  "output": {
    "diarization": [
      {
        "speaker": "SPEAKER_00",
        "start": 3.005,
        "end": 5.945
      },
      {
        "speaker": "SPEAKER_01",
        "start": 6.345,
        "end": 9.565
      },
      ...
    ],
    "identification": [
      {
        "speaker": "John Doe",
        "start": 3.005,
        "end": 5.945,
        "diarizationSpeaker": "SPEAKER_00",
        "match": "John Doe"
      },
      {
        "speaker": "SPEAKER_01",
        "start": 6.345,
        "end": 9.565,
        "diarizationSpeaker": "SPEAKER_01",
        "match": null
      },
      ...
    ],
    "voiceprints": [
      {
        "speaker": "SPEAKER_00",
        "match": "John Doe",
        "confidence": {
          "John Doe": 86
        }
      },
      {
        "speaker": "SPEAKER_01",
        "match": null,
        "confidence": {
          "John Doe": 16
        }
      }
    ]
  }
}
Learn more details about each parameter of the identification output in the identification schema reference.

Understanding the Results

Diarization vs Identification

  • Diarization: Separates audio into speaker segments with generic labels (SPEAKER_00, SPEAKER_01, etc.)
  • Identification: Matches those segments to known voiceprints with specific labels (John Doe, Jane Smith, etc.)

Confidence scores

The confidence scores show how well each voiceprint matches each speaker segment:
  • Higher scores indicate better matches
  • Use the threshold parameter to filter out low-confidence matches
  • Consider the context when interpreting confidence scores

Matching options

  • matching.threshold: Minimum confidence score required for a match (0-100, default: 0). Set higher values (50-70) for more strict matching, lower values for more lenient matching.
  • matching.exclusive: Prevent multiple speakers from matching the same voiceprint (default: true). Set to false if you want multiple speakers to potentially match the same voiceprint.