When LLMs disagree, they actually reason better - and now 51.9% faster too, making small LLMs think as deeply as the big ones by learning from their own mistakes.
Lets smaller LLMs match the reasoning capabilities of models 5x their size
With it, LLMs can now explore reasoning paths 51.9% faster by combining expert-amateur model disagreement with speculative search
📚 https://arxiv.org/abs/2410.01707
Original Problem 🎯:
MCTS reasoning in LLMs faces three key challenges: slow speed compared to Chain of Thought (CoT), dependency on complex reward models requiring multiple LLMs, and limited analysis of MCTS components from an interpretability perspective.
-----
Solution in this Paper 🔧:
• Introduced SC-MCTS* (Speculative Contrastive Monte Carlo Tree Search) with three core components:
- Novel contrastive reward model using expert/amateur model divergence
- Statistical method to combine multiple reward functions
- Speculative decoding integration for 51.9% speed improvement
• Key mechanisms:
- Action-level Jensen-Shannon divergence between expert/amateur models
- Multi-RM method for normalizing rewards across different modes
- Refined UCT strategy with optimized exploration constant
- Enhanced backpropagation favoring steadily improving paths
-----
Key Insights 💡:
• Reward model is the most crucial component affecting MCTS reasoning performance
• Combining multiple rewards requires careful statistical normalization
• Action-level contrastive decoding outperforms token-level approaches
• UCT strategy's effectiveness heavily depends on exploration constant optimization
-----
Results 📊:
• Outperformed OpenAI's o1-mini by 17.4% using Llama-3.1-70B on Blocksworld dataset
• Achieved 51.9% speed improvement per node using speculative decoding
• Surpassed 4-shot Chain of Thought across both easy and hard modes
• In easy mode, matched performance of Llama-3.1-405B using only Llama-3.1-70B
Share this post