R3: Reverse Curriculum Reinforcement Learning provides efficient, step-wise learning for LLM reasoning without process annotations.
📚 https://arxiv.org/pdf/2402.05808
Original Problem 🧠:
Existing RL methods for LLM reasoning face challenges with sparse rewards and high annotation costs. Outcome supervision provides sparse final rewards, while process supervision requires expensive step-wise annotations.
-----
Solution in this Paper 💡:
• Uses only outcome supervision to achieve benefits of process supervision
• Starts exploration from demonstration end states, gradually moving start point backwards
• Creates curriculum of increasing difficulty, enabling step-wise learning
• Mixes start states of varying difficulties to improve generalization
-----
Key Insights from this Paper 💡:
• Achieves process-supervision-like benefits using only outcome supervision
• Enables more efficient exploration by shortening reasoning chains
• Provides stable, significant optimization across various reasoning tasks
• Outperforms SFT and RL baselines on both natural language and program-based reasoning
-----
Results 📊:
• Outperforms SFT by 5.4 points and RL by 4.1 points on average across 8 reasoning tasks
• On GSM8K math reasoning:
- CoT: 50.5% accuracy (vs 44.7% RL baseline)
- P-CoT: 74.2% accuracy with CodeLlama-7B (vs 70.7% RL baseline)
• Matches performance of larger models and GPT-3.5 on GSM8K
Share this post