"Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.06781
The challenge in mathematical reasoning with LLMs lies in effectively training them with sparse outcome-based rewards. This paper addresses how to optimize LLMs for complex math problems using only binary correctness feedback.
This paper introduces OREAL, an Outcome REward-based reinforcement Learning framework. OREAL uses behavior cloning on best-of-N sampled positive trajectories and reward shaping for negative samples. It also incorporates a token-level reward model to handle sparse rewards in long reasoning chains.
-----
📌 OREAL effectively uses binary outcome rewards in Reinforcement Learning. Best-of-N sampling with reward shaping overcomes sparse feedback issues for complex mathematical reasoning in LLMs.
📌 Token-level reward model in OREAL provides lightweight, step-wise credit assignment. This addresses the challenge of long reasoning chains and sparse rewards, focusing learning on crucial tokens.
📌 OREAL's practical strength is enabling smaller 7B models to match or exceed larger 32B models trained via distillation. This highlights efficient Reinforcement Learning for mathematical reasoning.
----------
Methods Explored in this Paper 🔧:
→ OREAL framework is proposed for training LLMs on mathematical reasoning tasks using only outcome rewards.
→ Behavior Cloning on Best-of-N (BoN) sampling is used for positive trajectories. This method learns from successful reasoning paths by mimicking the best outcomes from multiple attempts.
→ Reward shaping is applied to negative samples to ensure gradient consistency. This technique adjusts rewards for incorrect solutions to guide the model away from failure modes more effectively.
→ A token-level reward model is integrated to address sparse rewards in long reasoning chains. This model assigns importance weights to individual tokens in the reasoning process, focusing learning on critical steps.
-----
Key Insights 💡:
→ Behavior cloning on BoN-sampled positive trajectories is sufficient for KL-regularized optimal policy learning in binary feedback environments. This means focusing on successful examples is key when only final answer correctness is known.
→ Reward reshaping for negative samples is necessary to maintain gradient consistency between positive and negative learning signals. This ensures the model learns from both successes and failures in a balanced way.
→ Token-level reward modeling provides a lightweight credit assignment scheme to handle partial correctness in long reasoning chains. This helps in identifying and reinforcing important reasoning steps within complex solutions.
-----
Results 📊:
→ OREAL-7B achieves 91.0 pass@1 accuracy on MATH-500. This surpasses previous 7B models and even outperforms some 32B models.
→ OREAL-32B achieves 95.0 pass@1 accuracy on MATH-500. This sets a new state-of-the-art result among 32B models, outperforming prior distillation and RL-based approaches.
→ OREAL improves DeepSeek-R1-Distilled-Qwen-7B from 92.8 to 94.0 pass@1 accuracy on MATH-500. This demonstrates OREAL's effectiveness even with strong initial models.