ReFT: Leveraging reinforcement learning to expand LLM's mathematical problem-solving capabilities.
Improves on Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) data
📚 https://arxiv.org/pdf/2401.08967
Original Problem 🔍:
SFT using Chain-of-Thought (CoT) annotations for math problem-solving lacks strong generalization due to reliance on single reasoning paths.
-----
Solution in this Paper 🧠:
• Reinforced Fine-Tuning (ReFT) approach:
- Warm-up stage with SFT
- Online reinforcement learning using PPO
- Samples multiple reasoning paths
- Rewards derived from ground-truth answers
• Applies to both natural language and program-based CoTs
• Compatible with majority voting and reward model reranking
-----
Key Insights from this Paper 💡:
• ReFT learns from multiple CoT paths, improving generalization
• No need for extra training data or reward models
• Outperforms SFT and self-training baselines
• Effective on small models and various datasets
• KL divergence crucial for policy stability
-----
Results 📊:
• ReFT outperforms SFT on GSM8K, SVAMP, and MathQA datasets
• 9-point improvement on GSM8K N-CoT with CodeLLAMA
• 3.7-point average improvement on N-CoT across datasets
• 5.9-point average improvement on P-CoT across datasets
• 79.3% accuracy on GSM8K with CodeLLAMA + ReFT + Reranking (P-CoT)
• Surpasses GPT-3.5-turbo performance with only 7B parameters
Share this post