Multi-agent collaboration beats single-model performance in solving complex math problems
Mars-PO introduces a multi-agent framework that enhances mathematical reasoning in LLMs by combining strengths of multiple models. It uses hybrid preference optimization to construct high-quality training samples from diverse agents, addressing limitations of traditional Direct Preference Optimization methods.
-----
https://arxiv.org/abs/2411.19039
🤔 Original Problem:
→ LLMs struggle with mathematical reasoning tasks due to errors and inconsistencies in multi-step problem-solving
→ Traditional Direct Preference Optimization (DPO) methods show limited improvement or even degrade performance on math tasks
-----
🔧 Solution in this Paper:
→ Mars-PO operates in three stages: response generation from multiple agents, preference pair construction, and hybrid optimization
→ Multiple agents generate diverse solutions for each math problem
→ A reward model evaluates and combines the best solutions into a hybrid positive sample set
→ Agent-specific negative samples are paired with shared positive samples for targeted improvement
→ Iterative training process progressively refines each agent's capabilities
-----
💡 Key Insights:
→ Combining strengths of multiple agents produces better training data than single-agent approaches
→ Agent-specific negative samples help address individual weaknesses while maintaining shared knowledge
→ Iterative training with temperature adjustment enhances model robustness
-----
📊 Results:
→ Improved Llama3.1-8B-Instruct accuracy on MATH benchmark from 50.38% to 57.82%
→ Consistently outperformed vanilla DPO and supervised fine-tuning across all models
→ Achieved performance gains up to 8% on mathematical reasoning tasks
Share this post