By treating preference learning as a two-player game, Self-Play Preference Optimization (SPPO) trains LLMs to match complex human preferences.
📚 https://arxiv.org/pdf/2405.00675
Original Problem 🎯:
Standard RLHF approaches using parametric models like Bradley-Terry cannot capture intransitive and irrational human preferences, limiting LLM alignment accuracy.
-----
Solution in this Paper 🔧:
• Introduces Self-Play Preference Optimization (SPPO) treating preference learning as two-player constant-sum game
• Uses multiplicative weights update algorithm to find Nash equilibrium policy
• Proposes new objective function avoiding pairwise comparisons, driven by game theory and policy gradient
• Implicitly encourages token-level optimal value function learning
• Converges to approximate Nash equilibrium with theoretical guarantees
-----
Key Insights 💡:
• Human preferences often exhibit inconsistency and intransitivity
• Direct preference probability prediction outperforms reward modeling
• Square loss objective better matches response likelihood to win rate
• Token-level optimization more effective than pairwise comparison
• Self-play mechanism enables stable iterative improvement
-----
Results 📊:
• Achieves 28.53% length-controlled win rate against GPT-4-Turbo on AlpacaEval 2.0 using Mistral-7B
• Reaches 38.77% win rate starting from Llama-3-8B-Instruct
• Consistent gains across MT-Bench, Arena-Hard and Open LLM Leaderboard
• Outperforms DPO/IPO while maintaining shorter response lengths
• All achieved without external supervision from stronger models
Share this post