Self-training LLMs master complex math by studying their successful and failed solution paths
Achieve significant performance gains in complex reasoning tasks using AlphaLLM-CPL.
This paper enhances LLM reasoning through innovative Monte Carlo Tree Search (MCTS) behavior distillation and curriculum preference learning.
📚 https://arxiv.org/abs/2410.06508
Original Problem 🔍:
LLMs struggle with complex reasoning tasks like mathematical problem-solving due to limitations in leveraging rich trajectory information from Monte Carlo Tree Search (MCTS).
-----
Solution in this Paper 🧠:
AlphaLLM-CPL, a novel pairwise training framework for LLM self-improvement through MCTS behavior distillation:
• Constructs stepwise trajectory pairs from child nodes sharing the same parent in the search tree
• Introduces curriculum preference learning to dynamically adjust training sequence of trajectory pairs
• Prioritizes critical learning steps and mitigates overfitting
-----
Key Insights from this Paper 💡:
• Stepwise trajectory pairs provide crucial step-level information for effective MCTS behavior distillation
• Combining preference reward gap and policy prediction gap in curriculum learning metric improves performance
• AlphaLLM-CPL continues improving model performance even after multiple epochs of offline training
-----
Results 📊:
• LLaMA-2 7B: GSM8K score improved from 14.6 to 36.5 (150% increase)
• Mistral 7B: GSM8K score improved from 38.5 to 57.3 (48.8% increase)
• LLaMA3-8B-instruct: MATH score improved from 28.2 to 33.1 (17.4% increase)
• Consistently outperformed other MCTS behavior distillation methods
Share this post