"Influencing Humans to Conform to Preference Models for RLHF"

Playback speed

Share post at current time

0:00

Transcript

"Influencing Humans to Conform to Preference Models for RLHF"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 23, 2025

Improve RLHF alignment by influencing human preferences, not just reward functions.

RLHF algorithms assume a model of human preferences, but a mismatch between the assumed model and how humans generate preferences can lead to misaligned reward functions.

This paper investigates methods to influence human preferences to better conform to a desired model, improving alignment.

https://arxiv.org/abs/2501.06416

Original Problem 🤔:

→ LLMs trained with Reinforcement Learning from Human Feedback (RLHF) rely on models of human preferences.

→ A mismatch between the assumed model and actual human preferences can lead to poorly aligned reward functions.

Solution in this Paper 💡:

→ This paper proposes influencing human preferences during data collection to better match the assumed model.

→ Three interventions are introduced: showing privileged information (e.g., regret or partial return), training humans to follow a preference model, and modifying the preference elicitation question.

Key Insights from this Paper 🔑:

→ Human preferences can be significantly influenced to conform to a specific model, like partial return or regret, without changing the underlying reward function.

→ Training and interface design can be valuable tools for improving alignment in RLHF.

→ Modifying the preference elicitation question can moderately influence preferences towards certain models, which has practical implications for real-world RLHF deployments.

Results 💯:

→ Showing privileged information significantly increased the likelihood of the dataset matching the target preference model (p < 0.01).

→ Training humans also significantly improved the likelihood (p < 0.01 for both regret and partial return models).

→ Changing the question showed a significant effect only for the partial return model (p < 0.05), with a smaller effect size.

Rohan's Bytes

"Influencing Humans to Conform to Preference Models for RLHF"

Discussion about this video