Multi-agent debate with diverse models beats GPT-4 at math reasoning, using just medium-sized models.
This paper demonstrates that using diverse medium-capacity models in multi-agent debate significantly improves mathematical reasoning, outperforming larger individual models like GPT-4.
-----
https://arxiv.org/abs/2410.12853
🤔 Original Problem:
→ LLMs often produce incorrect responses in mathematical reasoning tasks despite appearing confident.
-----
🔧 Solution in this Paper:
→ The paper implements a multi-agent debate framework using three diverse models (Gemini-Pro, Mixtral 7B, PaLM 2-M).
→ Models engage in structured debate rounds, each providing responses to mathematical problems.
→ A fourth model (Gemini-Pro) summarizes the debate after each round.
→ The framework iteratively refines answers through multiple debate rounds.
-----
💡 Key Insights:
→ Diversity of thought in debating models is more important than model size
→ Multi-agent debate helps improve reasoning at any model scale
→ Medium-capacity models in diverse configurations can outperform larger individual models
-----
📊 Results:
→ Diverse medium-capacity models achieved 91% accuracy on GSM-8K benchmark, surpassing GPT-4
→ Set new state-of-the-art performance of 94% on ASDiv benchmark
→ Homogeneous setup with three Gemini-Pro instances only reached 82% accuracy
Share this post