"Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment"

Playback speed

Share post at current time

0:00

Transcript

"Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 23, 2025

Stop reward hacking! Causal rewards make LLMs fairer and more reliable.

Reinforcement Learning from Human Feedback (RLHF) is susceptible to reward hacking due to spurious correlations in training data. This paper introduces Causal Reward Modeling (CRM) to mitigate these correlations by enforcing counterfactual invariance.

-----

https://arxiv.org/abs/2501.09620

Original Problem 🤔:

→ LLMs trained with RLHF often exhibit biases like length bias, sycophancy, concept bias, and discrimination due to spurious correlations in the reward data.

-----

Solution in this Paper 💡:

→ Causal Reward Modeling (CRM) integrates causal inference into reward modeling.

→ CRM enforces counterfactual invariance, ensuring that reward predictions remain consistent when irrelevant variables are altered.

→ This is done by applying Maximum Mean Discrepancy (MMD) regularization to minimize discrepancies in reward predictions across groups based on spurious factors.

-----

Key Insights from this Paper 🔑:

→ Counterfactual invariance is a key principle for robust reward modeling.

→ MMD regularization provides a practical way to enforce counterfactual invariance.

→ CRM can mitigate various types of biases in LLMs, improving alignment with human preferences.

-----

Results 💯:

→ Reduces sycophancy from 92.67% to 19.78% on semi-synthetic data.

→ Improves win rate against SFT baseline on Alpaca dataset, mitigating length bias.

→ Reduces concept bias by up to 97% on Yelp dataset.

→ Lowers discrimination scores across various demographic attributes.