Getting machines to know when they don't know - solved with multiple LLM consensus.
This paper proposes a framework called Calib-n that improves LLM calibration by aggregating responses from multiple models and using specialized loss functions.
-----
https://arxiv.org/abs/2501.03991
Original Problem 🔍:
→ Current calibration methods for LLMs lack generalization across different prompt styles and model sizes
→ Existing methods are limited to evaluating only one or two LLMs and prompt types
→ There's no comprehensive analysis of how response agreement and loss functions affect calibration
-----
Solution in this Paper 🛠️:
→ Calib-n framework trains an auxiliary model that combines outputs from multiple LLMs to estimate confidence
→ It incorporates three loss functions: binary cross-entropy, focal loss, and AUC surrogate loss
→ The system evaluates across 12 LLMs using four different prompt styles: verbalized, chain-of-thought, zero-shot, and few-shot
→ Response agreement between models helps reduce overconfidence and improves calibration reliability
-----
Key Insights 💡:
→ Few-shot prompts are most effective for auxiliary model-based methods
→ Focal loss outperforms other loss functions in most settings
→ Response agreement significantly improves calibration performance
→ Auxiliary models maintain stable calibration across varying accuracy levels
-----
Results 📊:
→ Calib-n with focal loss achieved lowest ECE scores across 4 datasets
→ Auxiliary models outperform LLMs' internal probabilities in 78% of test cases
→ Few-shot prompts showed 32% better calibration than other prompt styles
Share this post