Make LLMs explore beyond human rewards: unleash curiosity.
The paper addresses the challenge of aligning LLMs with human preferences using Reinforcement Learning from Human Feedback (RLHF). It introduces a novel curiosity-driven approach to enhance exploration and improve the efficiency of RLHF.
https://arxiv.org/abs/2501.11463
Original Problem 🧐:
→ Existing RLHF methods for LLMs often struggle with sparse and delayed rewards from human feedback.
→ This can lead to inefficient exploration and suboptimal policy learning.
→ Current approaches may not fully capture the nuances of human preferences, especially in complex tasks.
-----
Solution in this Paper 😎:
→ This paper proposes Curiosity-driven Reinforcement Learning from Human Feedback (CRF).
→ CRF augments the standard RLHF reward with an intrinsic curiosity reward.
→ This curiosity reward encourages the LLM to explore novel and uncertain states during training.
→ The curiosity reward is calculated based on the prediction error of a learned dynamics model.
→ Specifically, it uses the disagreement between multiple dynamics models to estimate uncertainty.
→ This encourages exploration beyond simply maximizing immediate human feedback.
→ CRF aims to improve exploration efficiency and discover better policies aligned with human preferences.
-----
Key Insights from this Paper 🤔:
→ Intrinsic motivation, specifically curiosity, can be effectively integrated into RLHF for LLMs.
→ Using prediction error from ensemble dynamics models provides a robust signal for curiosity.
→ Curiosity-driven exploration can help overcome the limitations of sparse and delayed human feedback.
→ This approach can lead to more efficient and effective alignment of LLMs with human preferences.
-----
Results 🤩:
→ CRF (Curiosity-Driven Reinforcement Learning) outperforms baseline RLHF methods on text generation tasks.
→ CRF achieves a 9.8% win rate improvement over standard RLHF in pairwise human preference evaluations.
→ CRF shows improved exploration behavior, leading to the discovery of more diverse and preferred text outputs.
Share this post