0:00
/
0:00
Transcript

Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning

Generated this podcast with Google's Illuminate.

Self-training LLMs master complex math by studying their successful and failed solution paths

Achieve significant performance gains in complex reasoning tasks using AlphaLLM-CPL.

This paper enhances LLM reasoning through innovative Monte Carlo Tree Search (MCTS) behavior distillation and curriculum preference learning.

📚 https://arxiv.org/abs/2410.06508

Original Problem 🔍:

LLMs struggle with complex reasoning tasks like mathematical problem-solving due to limitations in leveraging rich trajectory information from Monte Carlo Tree Search (MCTS).

-----

Solution in this Paper 🧠:

AlphaLLM-CPL, a novel pairwise training framework for LLM self-improvement through MCTS behavior distillation:

• Constructs stepwise trajectory pairs from child nodes sharing the same parent in the search tree

• Introduces curriculum preference learning to dynamically adjust training sequence of trajectory pairs

• Prioritizes critical learning steps and mitigates overfitting

-----

Key Insights from this Paper 💡:

• Stepwise trajectory pairs provide crucial step-level information for effective MCTS behavior distillation

• Combining preference reward gap and policy prediction gap in curriculum learning metric improves performance

• AlphaLLM-CPL continues improving model performance even after multiple epochs of offline training

-----

Results 📊:

• LLaMA-2 7B: GSM8K score improved from 14.6 to 36.5 (150% increase)

• Mistral 7B: GSM8K score improved from 38.5 to 57.3 (48.8% increase)

• LLaMA3-8B-instruct: MATH score improved from 28.2 to 33.1 (17.4% increase)

• Consistently outperformed other MCTS behavior distillation methods

Discussion about this video