Taming Overconfidence in LLMs: Reward Calibration in RLHF

New training methods teach LLMs to stop being overconfident about wrong answers.

Nov 11, 2024

New training methods teach LLMs to stop being overconfident about wrong answers.

PPO-M and PPO-C make LLMs admit when they're unsure, just like humans do

Original Problem 🚨:

RLHF-trained LLMs exhibit overconfidence, expressing high confidence in responses regardless of their quality. This stems from biases in reward models that favor high-confidence scores, leading to poor calibration.

Solution in this Paper 🛠️:

PPO-M (Proximal Policy Optimization with Calibrated Reward Modeling):
- Integrates explicit confidence scores into reward model training.
- Encourages alignment between confidence levels and response quality.
PPO-C (Proximal Policy Optimization with Calibrated Reward Calculation):
- Adjusts reward scores during training based on a moving average of past rewards.
- Reduces bias towards high-confidence responses.

Key Insights from this Paper 💡:

Overconfidence in RLHF-LLMs is linked to biased reward models.
Calibration can be improved without additional golden labels.
Both PPO-M and PPO-C can be integrated into existing RLHF frameworks.

Results 📊:

PPO-M and PPO-C reduce Expected Calibration Error (ECE) while maintaining accuracy.
On Llama3-8B, PPO-M reduces ECE by 6.44 points and increases accuracy by 2.73 points on GSM8K.
Both methods preserve capabilities in open-ended conversation settings.

Rohan's Bytes

Discussion about this post