"Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning"

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

"Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 27, 2024

Transcript

Two LLMs collaborate to solve math problems step-by-step, learning from each other's mistakes

📚 https://arxiv.org/abs/2410.22304

🎯 Original Problem:

Mathematical reasoning remains a significant challenge for LLMs. While datasets with questions and answers exist, generating detailed, accurate reasoning steps is difficult. Human-annotated steps are often too concise or disorganized for effective training.

-----

🔧 Solution in this Paper:

→ Introduces Flow-DPO, a multi-agent framework using two LLMs working together:

- Answer LLM: Generates small solution chunks

- Stop LLM: Determines if the answer is complete

→ Uses online Direct Preference Optimization (DPO) with random rollouts:

- Generates alternative answer paths at each node

- Forms DPO training pairs when paths lead to different outcomes

- Updates models in real-time as new data arrives

→ Allows flexible chunk sizes instead of predefined reasoning steps

-----

💡 Key Insights: