0:00
/
0:00
Transcript

"Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning"

The podcast on this paper is generated with Google's Illuminate.

Two LLMs collaborate to solve math problems step-by-step, learning from each other's mistakes

📚 https://arxiv.org/abs/2410.22304

🎯 Original Problem:

Mathematical reasoning remains a significant challenge for LLMs. While datasets with questions and answers exist, generating detailed, accurate reasoning steps is difficult. Human-annotated steps are often too concise or disorganized for effective training.

-----

🔧 Solution in this Paper:

→ Introduces Flow-DPO, a multi-agent framework using two LLMs working together:

- Answer LLM: Generates small solution chunks

- Stop LLM: Determines if the answer is complete

→ Uses online Direct Preference Optimization (DPO) with random rollouts:

- Generates alternative answer paths at each node

- Forms DPO training pairs when paths lead to different outcomes

- Updates models in real-time as new data arrives

→ Allows flexible chunk sizes instead of predefined reasoning steps

-----

💡 Key Insights:

→ Multi-agent collaboration beats single model inference

→ Real-time learning with dense rewards improves performance

→ Incremental verification works better than final answer checking

→ Compatible with other enhancement techniques

-----

📊 Results:

→ Llama-3-8B-Instruct: 20% accuracy improvement within 2000 training instances

→ Phi-3-medium: Improved from 79% to 83% accuracy

→ Flow-generated traces outperformed both ground truth and self-generated traces on GSM8K and MATH benchmarks

Discussion about this video