0:00
/
0:00
Transcript

"Mars-PO: Multi-Agent Reasoning System Preference Optimization"

The podcast on this paper is generated with Google's Illuminate.

Multi-agent collaboration beats single-model performance in solving complex math problems

Mars-PO introduces a multi-agent framework that enhances mathematical reasoning in LLMs by combining strengths of multiple models. It uses hybrid preference optimization to construct high-quality training samples from diverse agents, addressing limitations of traditional Direct Preference Optimization methods.

-----

https://arxiv.org/abs/2411.19039

🤔 Original Problem:

→ LLMs struggle with mathematical reasoning tasks due to errors and inconsistencies in multi-step problem-solving

→ Traditional Direct Preference Optimization (DPO) methods show limited improvement or even degrade performance on math tasks

-----

🔧 Solution in this Paper:

→ Mars-PO operates in three stages: response generation from multiple agents, preference pair construction, and hybrid optimization

→ Multiple agents generate diverse solutions for each math problem

→ A reward model evaluates and combines the best solutions into a hybrid positive sample set

→ Agent-specific negative samples are paired with shared positive samples for targeted improvement

→ Iterative training process progressively refines each agent's capabilities

-----

💡 Key Insights:

→ Combining strengths of multiple agents produces better training data than single-agent approaches

→ Agent-specific negative samples help address individual weaknesses while maintaining shared knowledge

→ Iterative training with temperature adjustment enhances model robustness

-----

📊 Results:

→ Improved Llama3.1-8B-Instruct accuracy on MATH benchmark from 50.38% to 57.82%

→ Consistently outperformed vanilla DPO and supervised fine-tuning across all models

→ Achieved performance gains up to 8% on mathematical reasoning tasks

Discussion about this video