AlphaPO improves LLM alignment by adjusting the reward function shape. This addresses likelihood displacement and over-optimization in direct alignment algorithms.
-----
https://arxiv.org/abs/2501.03884
Original Problem 🤔:
→ Direct alignment algorithms (DAAs) like DPO and SimPO often suffer from likelihood displacement.
→ Preferred responses become less likely, while undesired responses become more likely.
-----
Solution in this Paper 💡:
→ AlphaPO introduces an α parameter to control the reward function shape in DAAs.
→ This parameter modifies the standard log reward used in SimPO.
-----
Key Insights from this Paper 🔑:
→ Reward shape significantly influences preference optimization in LLMs.
→ AlphaPO provides fine-grained control over likelihood displacement.
→ Tuning α improves alignment performance and generalization.
-----
Results 📊:
→ AlphaPO shows 7%-10% relative improvement in length-controlled win rate over SimPO on AlpacaEval 2 for Llama3-8B and Mistral-7B.
→ AlphaPO achieves a 47.42% length-controlled win rate for PairRM-based regeneration of UltraFeedback data when combined with SPPO, surpassing SimPO+SPPO.
Share this post