Cal-DPO fixes the reward problem in LLM training by keeping good responses actually good.
Cal-DPO improves LLM alignment by calibrating rewards learned from human preferences, preventing reward degradation while maintaining good preference ordering.
-----
https://arxiv.org/abs/2412.14516v1
🤔 Original Problem:
→ Current LLM alignment methods like DPO focus on relative reward differences between chosen and rejected responses, ignoring absolute reward values
→ This causes the likelihood of chosen responses to decrease during training, leading to poor performance on tasks like reasoning and math
-----
🔧 Solution in this Paper:
→ Cal-DPO adds a calibration term to match implicit rewards with ground-truth rewards
→ It pushes chosen response rewards toward +1/2β and rejected response rewards toward -1/2β
→ Implementation requires just one additional line of code on top of DPO
→ No extra hyperparameters needed
-----
💡 Key Insights:
→ Calibrating rewards prevents likelihood degradation while maintaining preference ordering
→ Exhibits mode-seeking behavior like RLHF
→ Shows negative gradient property to reduce undesirable responses
→ Theoretically guaranteed to yield optimal policy
-----
📊 Results:
→ 63.1% improvement on IFEval benchmark
→ 12.5% gain on Math tasks
→ Higher win rates against both SFT and chosen responses in human evaluation
→ Consistently outperforms DPO across reasoning benchmarks
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post