Skip to main content
Confidence scores provide a measure of the certainty of the model in its predictions. These scores range from 0 to 100, with higher values indicating greater confidence.

Types of confidence scores

There are three types of confidence scores available:

1. Sample-level confidence scores

Sample-level confidence scores provide granular confidence values at regular intervals throughout the audio. To include sample-level confidence scores in your diarization or identification results, add "confidence": true to your request body. When enabled, the job output will include:
  • The confidence object containing:
    • score, an array with the confidence score for each sample
    • resolution, indicating the time interval in seconds between confidence score samples
For example, if the resolution is 0.02, it means each confidence score represents a 20-millisecond interval in the audio.

2. Turn-level confidence scores

Turn-level confidence scores provide confidence values for each diarization segment (turn), making it easier to assess the quality of specific speaker assignments. To include turn-level confidence scores in your results, add "turnLevelConfidence": true to your request body. When enabled, the job output will include a "confidence" object for each diarization segment. The object contains each speaker as the key and a number from 0-100 indicating the confidence score for each speaker assignment.

3. Identification confidence scores

Identification confidence scores are specific to speaker identification tasks and show how well each voiceprint matches each speaker segment. These scores are included automatically when using the identify endpoint. The identification output includes a voiceprints array where each speaker has a confidence object containing the confidence scores for each voiceprint label:
{
  "voiceprints": [
    {
      "speaker": "SPEAKER_00",
      "match": "John Doe",
      "confidence": {
        "John Doe": 86,
        "Jane Smith": 12
      }
    }
  ]
}
Identification confidence scores are different from diarization confidence scores. They measure how well a voiceprint matches a speaker, not how certain the diarization model is about the speaker assignment.

Enabling confidence scores

By default, confidence scores are not included in the output. You must explicitly enable them in your request:
{
  "url": "https://example.com/audio.wav",
  "confidence": true,
  "turnLevelConfidence": true
}
For identification jobs, confidence scores are included automatically in the voiceprints section of the output.

Using confidence scores

Quality assessment

Confidence scores can be particularly useful for:
  • Identifying segments where the model is less certain, which may benefit from manual review
  • Filtering results based on a confidence threshold for higher accuracy
  • Analyzing the overall reliability of the diarization or identification for a given audio file

Human-in-the-loop correction

Confidence scores enable efficient human-in-the-loop workflows by highlighting segments that need attention:

Low confidence detection

Identify segments with confidence scores below your threshold (e.g., < 70) for manual review

Prioritized review

Focus human review time on the most uncertain segments rather than reviewing entire transcripts

Quality control

Use confidence scores as a quality metric to ensure diarization meets your accuracy requirements

Example: Human review workflow with turn-level confidence

Here’s how you might implement a human-in-the-loop correction workflow:
confidence_review.py
def prioritize_for_review(diarization_result, confidence_threshold=70):
    low_confidence_segments = []
    
    for index, segment in enumerate(diarization_result["segments"]):
        confidence = segment["confidence"]
        speaker_confidence = confidence[segment["speaker"]]
        
        if speaker_confidence < confidence_threshold:
            low_confidence_segments.append({
                "segment_index": index,
                "start_time": segment["start"],
                "end_time": segment["end"],
                "speaker": segment["speaker"],
                "confidence": speaker_confidence,
                "audio_url": generate_segment_audio_url(segment)
            })
    
    # Sort by lowest confidence first
    return sorted(low_confidence_segments, key=lambda x: x["confidence"])

# Generate review queue for human correction
review_queue = prioritize_for_review(diarization_result, 75)
print(f"Found {len(review_queue)} segments needing review")

Example: Identification confidence filtering

For identification tasks, you can filter results based on identification confidence:
identification_filter.py
def filter_by_identification_confidence(identification_result, threshold=50):
    filtered_voiceprints = []
    
    for vp in identification_result["voiceprints"]:
        # Find the highest confidence match
        top_match = max(vp["confidence"].items(), key=lambda x: x[1])
        
        if top_match[1] >= threshold:
            filtered_voiceprints.append(vp)
    
    return filtered_voiceprints

high_confidence_matches = filter_by_identification_confidence(identification_result, 60)
print(f"Found {len(high_confidence_matches)} high confidence matches")

Best practices

  • Threshold selection: The optimal confidence threshold depends on your use case. For instance, in critical applications sensitive to false positives, you might want to use a higher threshold.
  • Combine multiple types: Using different confidence score types together gives you comprehensive insight - turn-level for segment assessment, sample-level for detailed analysis, and identification confidence for voiceprint matching.
  • Performance impact: Enabling confidence scores may slightly increase processing time and output size. Only enable them when you actually need the confidence information.

Interpreting confidence scores

Confidence scores range from 0 to 100, with higher values indicating greater confidence in the model’s predictions. Use these scores to identify segments that may need human review or verification based on your specific accuracy requirements. For identification confidence scores, consider using the matching.threshold parameter in your identification request to automatically filter out low-confidence matches. By leveraging confidence scores effectively, you can build robust diarization and identification workflows that balance automation with human oversight, ensuring high-quality results while minimizing manual effort.