0:00
/
0:00
Transcript

"AlphaPO -- Reward shape matters for LLM alignment"

Generated below podcast on this paper with Google's Illuminate.

AlphaPO improves LLM alignment by adjusting the reward function shape. This addresses likelihood displacement and over-optimization in direct alignment algorithms.

-----

https://arxiv.org/abs/2501.03884

Original Problem 🤔:

→ Direct alignment algorithms (DAAs) like DPO and SimPO often suffer from likelihood displacement.

→ Preferred responses become less likely, while undesired responses become more likely.

-----

Solution in this Paper 💡:

→ AlphaPO introduces an α parameter to control the reward function shape in DAAs.

→ This parameter modifies the standard log reward used in SimPO.

-----

Key Insights from this Paper 🔑:

→ Reward shape significantly influences preference optimization in LLMs.

→ AlphaPO provides fine-grained control over likelihood displacement.

→ Tuning α improves alignment performance and generalization.

-----

Results 📊:

→ AlphaPO shows 7%-10% relative improvement in length-controlled win rate over SimPO on AlpacaEval 2 for Llama3-8B and Mistral-7B.

→ AlphaPO achieves a 47.42% length-controlled win rate for PairRM-based regeneration of UltraFeedback data when combined with SPPO, surpassing SimPO+SPPO.

Discussion about this video

User's avatar