"Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 12, 2025

Article voiceover

0:00

-4:48

https://arxiv.org/abs/2502.06781

The challenge in mathematical reasoning with LLMs lies in effectively training them with sparse outcome-based rewards. This paper addresses how to optimize LLMs for complex math problems using only binary correctness feedback.

This paper introduces OREAL, an Outcome REward-based reinforcement Learning framework. OREAL uses behavior cloning on best-of-N sampled positive trajectories and reward shaping for negative samples. It also incorporates a token-level reward model to handle sparse rewards in long reasoning chains.

-----

📌 OREAL effectively uses binary outcome rewards in Reinforcement Learning. Best-of-N sampling with reward shaping overcomes sparse feedback issues for complex mathematical reasoning in LLMs.

📌 Token-level reward model in OREAL provides lightweight, step-wise credit assignment. This addresses the challenge of long reasoning chains and sparse rewards, focusing learning on crucial tokens.

📌 OREAL's practical strength is enabling smaller 7B models to match or exceed larger 32B models trained via distillation. This highlights efficient Reinforcement Learning for mathematical reasoning.

----------

Methods Explored in this Paper 🔧:

→ OREAL framework is proposed for training LLMs on mathematical reasoning tasks using only outcome rewards.

→ Behavior Cloning on Best-of-N (BoN) sampling is used for positive trajectories. This method learns from successful reasoning paths by mimicking the best outcomes from multiple attempts.

→ Reward shaping is applied to negative samples to ensure gradient consistency. This technique adjusts rewards for incorrect solutions to guide the model away from failure modes more effectively.

→ A token-level reward model is integrated to address sparse rewards in long reasoning chains. This model assigns importance weights to individual tokens in the reasoning process, focusing learning on critical steps.

-----

Key Insights 💡:

→ Behavior cloning on BoN-sampled positive trajectories is sufficient for KL-regularized optimal policy learning in binary feedback environments. This means focusing on successful examples is key when only final answer correctness is known.

→ Reward reshaping for negative samples is necessary to maintain gradient consistency between positive and negative learning signals. This ensures the model learns from both successes and failures in a balanced way.

→ Token-level reward modeling provides a lightweight credit assignment scheme to handle partial correctness in long reasoning chains. This helps in identifying and reinforcing important reasoning steps within complex solutions.

-----

Results 📊:

→ OREAL-7B achieves 91.0 pass@1 accuracy on MATH-500. This surpasses previous 7B models and even outperforms some 32B models.

→ OREAL-32B achieves 95.0 pass@1 accuracy on MATH-500. This sets a new state-of-the-art result among 32B models, outperforming prior distillation and RL-based approaches.

→ OREAL improves DeepSeek-R1-Distilled-Qwen-7B from 92.8 to 94.0 pass@1 accuracy on MATH-500. This demonstrates OREAL's effectiveness even with strong initial models.

Rohan's Bytes

Discussion about this post