Instead of humans teaching AI good behavior, AI learns it through practice matches
Eva framework (Evolving Alignment via Asymmetric Self-Play) lets LLMs teach themselves alignment through competitive self-play
Two LLMs play creator-solver game to discover better alignment strategies with Self-evolving prompts.
https://arxiv.org/abs/2411.00062
🎯 Original Problem:
Current LLM alignment methods rely on static human-written prompts, limiting models' ability to generalize beyond training distribution and adapt to new scenarios.
-----
🔧 Solution in this Paper:
→ Introduces eva (Evolving Alignment via Asymmetric Self-Play) framework that casts alignment as a game between two players
→ Creator generates informative prompt distributions using reward model feedback
→ Solver learns to produce preferred responses on creator's prompts
→ Uses advantage-based proxy to estimate prompt informativeness and evolve new prompts
→ Integrates with existing preference optimization algorithms like DPO, SPPO
-----
💡 Key Insights:
→ Dynamic prompt evolution leads to better generalization than static prompts
→ Framework creates auto-curricula just beyond model's current capabilities
→ Approach provides worst-case guarantees through minimax regret objective
→ Method scales effectively with larger reward models
-----
📊 Results:
→ Improves win rate on Arena-Hard: +8.5% (DPO), +3.2% (SPPO), +8.4% (SimPO)
→ Outperforms models trained on additional human prompts by 5.5%
→ Shows consistent gains on AlpacaEval and MT-Bench benchmarks
Share this post