"Language Models Prefer What They Know: Relative Confidence Estimation via Confidence Preferences"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.01126
The paper addresses the issue of unreliable confidence scores from LLMs when asked directly about their certainty in answering questions. These direct or absolute confidence estimations are often coarse and fail to accurately reflect the correctness of the answers.
This paper proposes relative confidence estimation. It moves away from direct confidence scoring. Instead, it asks the LLM to compare its confidence between pairs of questions. These pairwise preferences are then aggregated into meaningful confidence scores using rank aggregation methods.
-----
📌 Relative confidence leverages models' strength in comparative judgments. It shifts from complex absolute score generation to simpler binary preference, improving estimation reliability.
📌 Rank aggregation methods like TrueSkill effectively convert noisy pairwise preferences into calibrated confidence scores. This bypasses the need for direct confidence calibration in LLMs.
📌 This method is readily deployable. It uses simple prompting and post-processing. It enhances existing LLMs' confidence without retraining or architectural changes.
----------
Methods Explored in this Paper 🔧:
→ The paper introduces relative confidence estimation. This method asks LLMs to compare confidence levels between pairs of questions.
→ For each question, the method pairs it with several other questions. The LLM is prompted to choose which question it is more confident in answering correctly. This generates confidence preference data.
→ The paper utilizes three rank aggregation techniques: Elo rating, TrueSkill, and Bradley-Terry. These methods convert pairwise confidence preferences into overall confidence scores for each question.
→ Elo rating iteratively updates question scores based on matchup outcomes. TrueSkill, a Bayesian model, refines score estimates by reducing uncertainty with more comparisons. Bradley-Terry uses maximum likelihood estimation to optimize scores across all comparisons simultaneously.
→ The confidence scores are then normalized to a 0-1 range for easier interpretation.
-----
Key Insights 💡:
→ LLMs struggle with absolute confidence estimation, often producing coarse-grained and uninformative scores.
→ Relative confidence estimation, by comparing question pairs, offers a more reliable way to gauge LLM confidence.
→ Rank aggregation methods can effectively translate pairwise confidence preferences into meaningful confidence scores.
→ Access to the model's own answer significantly enhances the reliability of relative confidence judgments.
-----
Results 📊:
→ Relative confidence estimation improves selective classification Area Under Curve (AUC) by 3.5% on average compared to direct absolute confidence estimation across five LLMs and 14 datasets.
→ Relative confidence estimation shows a 1.7% average AUC gain over self-consistency confidence methods.
→ Llama 3.1 405B saw the largest AUC improvement of 6.1% over direct prompting and 4.9% over self-consistency.
→ TrueSkill rank aggregation generally performed best across models for relative confidence estimation.