Two LLMs collaborate to solve math problems step-by-step, learning from each other's mistakes
📚 https://arxiv.org/abs/2410.22304
🎯 Original Problem:
Mathematical reasoning remains a significant challenge for LLMs. While datasets with questions and answers exist, generating detailed, accurate reasoning steps is difficult. Human-annotated steps are often too concise or disorganized for effective training.
-----
🔧 Solution in this Paper:
→ Introduces Flow-DPO, a multi-agent framework using two LLMs working together:
- Answer LLM: Generates small solution chunks
- Stop LLM: Determines if the answer is complete
→ Uses online Direct Preference Optimization (DPO) with random rollouts:
- Generates alternative answer paths at each node
- Forms DPO training pairs when paths lead to different outcomes
- Updates models in real-time as new data arrives
→ Allows flexible chunk sizes instead of predefined reasoning steps
-----
💡 Key Insights:
→ Multi-agent collaboration beats single model inference
→ Real-time learning with dense rewards improves performance
→ Incremental verification works better than final answer checking
→ Compatible with other enhancement techniques
-----
📊 Results:
→ Llama-3-8B-Instruct: 20% accuracy improvement within 2000 training instances
→ Phi-3-medium: Improved from 79% to 83% accuracy
→ Flow-generated traces outperformed both ground truth and self-generated traces on GSM8K and MATH benchmarks
Share this post