0:00
/
0:00
Transcript

"Evolving Alignment via Asymmetric Self-Play"

The podcast on this paper is generated with Google's Illuminate.

Instead of humans teaching AI good behavior, AI learns it through practice matches

Eva framework (Evolving Alignment via Asymmetric Self-Play) lets LLMs teach themselves alignment through competitive self-play

Two LLMs play creator-solver game to discover better alignment strategies with Self-evolving prompts.

https://arxiv.org/abs/2411.00062

🎯 Original Problem:

Current LLM alignment methods rely on static human-written prompts, limiting models' ability to generalize beyond training distribution and adapt to new scenarios.

-----

🔧 Solution in this Paper:

→ Introduces eva (Evolving Alignment via Asymmetric Self-Play) framework that casts alignment as a game between two players

→ Creator generates informative prompt distributions using reward model feedback

→ Solver learns to produce preferred responses on creator's prompts

→ Uses advantage-based proxy to estimate prompt informativeness and evolve new prompts

→ Integrates with existing preference optimization algorithms like DPO, SPPO

-----

💡 Key Insights:

→ Dynamic prompt evolution leads to better generalization than static prompts

→ Framework creates auto-curricula just beyond model's current capabilities

→ Approach provides worst-case guarantees through minimax regret objective

→ Method scales effectively with larger reward models

-----

📊 Results:

→ Improves win rate on Arena-Hard: +8.5% (DPO), +3.2% (SPPO), +8.4% (SimPO)

→ Outperforms models trained on additional human prompts by 5.5%

→ Shows consistent gains on AlpacaEval and MT-Bench benchmarks

Discussion about this video

User's avatar