"Process Reinforcement through Implicit Rewards"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.01456
The challenge in LLM (LLM) reinforcement learning lies in the difficulty of acquiring and using high-quality dense rewards for effective online training. Existing methods struggle with the cost of process labels and potential reward hacking.
This paper introduces PRIME, a method for Process Reinforcement through Implicit Rewards. PRIME uses implicit process rewards derived from an online-updated reward model. This approach bypasses the need for expensive process labels and reduces reward hacking risks in online LLM reinforcement learning.
-----
📌 PRIME ingeniously uses implicit process rewards. This allows online Process Reward Model updates using only outcome labels. This online update is key to preventing reward hacking, a major RL challenge.
📌 PRIME simplifies reward model training. It initializes the Implicit Process Reward Model directly from the Supervised Fine-Tuned model. This eliminates the need for a separate, costly reward model pre-training phase.
📌 PRIME is a versatile method. It integrates seamlessly with various Reinforcement Learning algorithms like RLOO, REINFORCE and PPO. It consistently boosts performance and sample efficiency across these algorithms.
----------
Methods Explored in this Paper 🔧:
→ PRIME uses an Implicit Process Reward Model (PRM). This PRM is trained online with outcome labels from policy rollouts.
→ Implicit process rewards are token-level rewards. They are calculated as the log-probability ratio between the PRM and a reference model.
→ PRIME combines these implicit process rewards with sparse outcome rewards. This combination is used in a Monte Carlo advantage estimation with a leave-one-out baseline.
→ The PRM is updated using cross-entropy loss. This loss is based on policy rollouts and outcome labels.
→ Policy updates are performed using Proximal Policy Optimization (PPO) with a clip surrogate objective.
→ The PRM is initialized directly from the Supervised Fine-Tuned (SFT) model. This eliminates a separate reward model training phase.
-----
Key Insights 💡:
→ Implicit Process Reward Modeling enables online reward model updates without costly process labels.
→ Online updating of the PRM is crucial. It prevents reward overoptimization and distribution shift during reinforcement learning.
→ Initializing the PRM with the SFT model is effective. It simplifies the process and improves performance.
→ PRIME is a general method. It can be integrated with various reinforcement learning algorithms.
→ Using implicit process rewards as a reward model is more effective than using them as a value model in this context.
-----
Results 📊:
→ Achieves a 15.1% average performance improvement across key reasoning benchmarks compared to the SFT model.
→ Eurus-2-7B-PRIME surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks. This is achieved with only 10% of Qwen-Math's training data.
→ Demonstrates a 2.5× gain in sample efficiency compared to Reinforcement Learning from Outcome Only (RLOO). Also achieves a 6.9% improvement in final reward compared to RLOO.
→ Eurus-2-7B-PRIME achieves a 26.7% pass@1 on the AIME 2024 benchmark.