0:00
/
0:00
Transcript

"Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning"

Generated this podcast with Google's Illuminate.

R3: Reverse Curriculum Reinforcement Learning provides efficient, step-wise learning for LLM reasoning without process annotations.

📚 https://arxiv.org/pdf/2402.05808

Original Problem 🧠:

Existing RL methods for LLM reasoning face challenges with sparse rewards and high annotation costs. Outcome supervision provides sparse final rewards, while process supervision requires expensive step-wise annotations.

-----

Solution in this Paper 💡:

• Uses only outcome supervision to achieve benefits of process supervision

• Starts exploration from demonstration end states, gradually moving start point backwards

• Creates curriculum of increasing difficulty, enabling step-wise learning

• Mixes start states of varying difficulties to improve generalization

-----

Key Insights from this Paper 💡:

• Achieves process-supervision-like benefits using only outcome supervision

• Enables more efficient exploration by shortening reasoning chains

• Provides stable, significant optimization across various reasoning tasks

• Outperforms SFT and RL baselines on both natural language and program-based reasoning

-----

Results 📊:

• Outperforms SFT by 5.4 points and RL by 4.1 points on average across 8 reasoning tasks

• On GSM8K math reasoning:

- CoT: 50.5% accuracy (vs 44.7% RL baseline)

- P-CoT: 74.2% accuracy with CodeLlama-7B (vs 70.7% RL baseline)

• Matches performance of larger models and GPT-3.5 on GSM8K

Discussion about this video