"The Perfect Blend: Redefining RLHF with Mixture of Judges"

Playback speed

Share post at current time

0:00

Transcript

"The Perfect Blend: Redefining RLHF with Mixture of Judges"

Generated this podcast with Google's Illuminate.

Rohan Paul

Jan 04, 2025

Great paper from @Meta on Constrained Generative Policy Optimization (CGPO) for multi-task large language model alignment.

Mixture of judges in CGPO prevents reward hacking while enhancing LLM capabilities in multi-task settings.

CGPO avoids PPO's severe regression on coding tasks.

📚 https://arxiv.org/pdf/2409.20370

Original Problem 🔍:

Current RLHF methods struggle with reward hacking and conflicting objectives in multi-task LLM alignment. Linear combinations of reward models lose key information.

Reward hacking is where the model learns to exploit imperfections or limitations in the reward model rather than truly improving its performance in line with human preferences.

-----

Solution in this Paper 💡:

• Introduces CGPO framework with mixture of judges (MoJs) to detect reward hacking

• Uses primal-type constrained RL optimizers: CRPG, CRRAFT, CODPO

• Tailors reward models, judges, and optimizers for each task

• Calibrated rewards address miscalibration issues

• Warm-up phase with DPO before online RLHF

-----

Key Insights from this Paper 💡:

• MoJs crucial for preventing reward hacking and boosting performance

• Task-specific optimization outperforms uniform approaches

• CGPO consistently improves across benchmarks, unlike PPO regression