Great paper from @Meta on Constrained Generative Policy Optimization (CGPO) for multi-task large language model alignment.
Mixture of judges in CGPO prevents reward hacking while enhancing LLM capabilities in multi-task settings.
CGPO avoids PPO's severe regression on coding tasks.
📚 https://arxiv.org/pdf/2409.20370
Original Problem 🔍:
Current RLHF methods struggle with reward hacking and conflicting objectives in multi-task LLM alignment. Linear combinations of reward models lose key information.
Reward hacking is where the model learns to exploit imperfections or limitations in the reward model rather than truly improving its performance in line with human preferences.
-----
Solution in this Paper 💡:
• Introduces CGPO framework with mixture of judges (MoJs) to detect reward hacking
• Uses primal-type constrained RL optimizers: CRPG, CRRAFT, CODPO
• Tailors reward models, judges, and optimizers for each task
• Calibrated rewards address miscalibration issues
• Warm-up phase with DPO before online RLHF
-----
Key Insights from this Paper 💡:
• MoJs crucial for preventing reward hacking and boosting performance
• Task-specific optimization outperforms uniform approaches
• CGPO consistently improves across benchmarks, unlike PPO regression
• Warm-up phase significantly enhances RLHF performance
-----
Results 📊:
CGPO outperforms PPO and DPO baselines across benchmarks:
• AlpacaEval-2: +18.4% (CRRAFT)
• Arena-Hard: +12.5% (CRRAFT)
• IFEval: +2% (CRPG/CRRAFT)
• MATH: +2% (CRPG)
• 0-shot HumanEval: +17% (CRPG)
• ARC Challenge: +2% (CRPG)
Share this post