0:00
/
0:00
Transcript

"Self-Play Preference Optimization for Language Model Alignment"

Generated this podcast on this Paper with Google's Illuminate, a specialized tool to create podcast from arXiv papers only

By treating preference learning as a two-player game, Self-Play Preference Optimization (SPPO) trains LLMs to match complex human preferences.

📚 https://arxiv.org/pdf/2405.00675

Original Problem 🎯:

Standard RLHF approaches using parametric models like Bradley-Terry cannot capture intransitive and irrational human preferences, limiting LLM alignment accuracy.

-----

Solution in this Paper 🔧:

• Introduces Self-Play Preference Optimization (SPPO) treating preference learning as two-player constant-sum game

• Uses multiplicative weights update algorithm to find Nash equilibrium policy

• Proposes new objective function avoiding pairwise comparisons, driven by game theory and policy gradient

• Implicitly encourages token-level optimal value function learning

• Converges to approximate Nash equilibrium with theoretical guarantees

-----

Key Insights 💡:

• Human preferences often exhibit inconsistency and intransitivity

• Direct preference probability prediction outperforms reward modeling

• Square loss objective better matches response likelihood to win rate

• Token-level optimization more effective than pairwise comparison

• Self-play mechanism enables stable iterative improvement

-----

Results 📊:

• Achieves 28.53% length-controlled win rate against GPT-4-Turbo on AlpacaEval 2.0 using Mistral-7B

• Reaches 38.77% win rate starting from Llama-3-8B-Instruct

• Consistent gains across MT-Bench, Arena-Hard and Open LLM Leaderboard

• Outperforms DPO/IPO while maintaining shorter response lengths

• All achieved without external supervision from stronger models

Discussion about this video