Improve RLHF alignment by influencing human preferences, not just reward functions.
RLHF algorithms assume a model of human preferences, but a mismatch between the assumed model and how humans generate preferences can lead to misaligned reward functions.
This paper investigates methods to influence human preferences to better conform to a desired model, improving alignment.
https://arxiv.org/abs/2501.06416
Original Problem 🤔:
→ LLMs trained with Reinforcement Learning from Human Feedback (RLHF) rely on models of human preferences.
→ A mismatch between the assumed model and actual human preferences can lead to poorly aligned reward functions.
Solution in this Paper 💡:
→ This paper proposes influencing human preferences during data collection to better match the assumed model.
→ Three interventions are introduced: showing privileged information (e.g., regret or partial return), training humans to follow a preference model, and modifying the preference elicitation question.
Key Insights from this Paper 🔑:
→ Human preferences can be significantly influenced to conform to a specific model, like partial return or regret, without changing the underlying reward function.
→ Training and interface design can be valuable tools for improving alignment in RLHF.
→ Modifying the preference elicitation question can moderately influence preferences towards certain models, which has practical implications for real-world RLHF deployments.
Results 💯:
→ Showing privileged information significantly increased the likelihood of the dataset matching the target preference model (p < 0.01).
→ Training humans also significantly improved the likelihood (p < 0.01 for both regret and partial return models).
→ Changing the question showed a significant effect only for the partial return model (p < 0.05), with a smaller effect size.
Share this post