"Evolving Alignment via Asymmetric Self-Play"

Playback speed

Share post at current time

0:00

Transcript

"Evolving Alignment via Asymmetric Self-Play"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 27, 2024

Instead of humans teaching AI good behavior, AI learns it through practice matches

Eva framework (Evolving Alignment via Asymmetric Self-Play) lets LLMs teach themselves alignment through competitive self-play

Two LLMs play creator-solver game to discover better alignment strategies with Self-evolving prompts.

https://arxiv.org/abs/2411.00062

🎯 Original Problem:

Current LLM alignment methods rely on static human-written prompts, limiting models' ability to generalize beyond training distribution and adapt to new scenarios.

-----

🔧 Solution in this Paper:

→ Introduces eva (Evolving Alignment via Asymmetric Self-Play) framework that casts alignment as a game between two players

→ Creator generates informative prompt distributions using reward model feedback

→ Solver learns to produce preferred responses on creator's prompts

→ Uses advantage-based proxy to estimate prompt informativeness and evolve new prompts

→ Integrates with existing preference optimization algorithms like DPO, SPPO

-----

💡 Key Insights:

→ Dynamic prompt evolution leads to better generalization than static prompts

→ Framework creates auto-curricula just beyond model's current capabilities

→ Approach provides worst-case guarantees through minimax regret objective

→ Method scales effectively with larger reward models

-----

📊 Results:

→ Improves win rate on Arena-Hard: +8.5% (DPO), +3.2% (SPPO), +8.4% (SimPO)

→ Outperforms models trained on additional human prompts by 5.5%

→ Shows consistent gains on AlpacaEval and MT-Bench benchmarks

Rohan's Bytes

"Evolving Alignment via Asymmetric Self-Play"

Discussion about this video