LLMs struggle to combine learned knowledge for novel math problems, unlike humans.
📚 https://arxiv.org/abs/2405.06680
Original Problem 🔍:
LLMs struggle with systematic compositionality in mathematical reasoning, despite impressive performance on complex tasks. This paper investigates their ability to combine learned knowledge components to solve novel problems.
-----
Solution in this Paper 💡:
• Constructs MATHTRAP dataset by adding logical traps to MATH/GSM8K problems
• Traps require combining math knowledge with trap-related knowledge
• Evaluates LLMs on original, trap, and conceptual problems
• Explores interventions: prompts, few-shot demos, fine-tuning
-----
Key Insights from this Paper 💡:
• LLMs fail to spontaneously combine knowledge to solve trap problems
• Stark performance gap between humans and LLMs on compositional tasks
• External interventions can improve LLM performance on trap problems
• Compositional generalization remains a key challenge for LLMs
-----
Results 📊:
• Closed-source LLMs: >70% on conceptual problems, <50% accuracy ratio on traps
• Open-source: ~40% on conceptual/original, <20% accuracy ratio on traps
• Humans: 83.8% on traps without notice, 95.1% with notice
• Interventions improved performance, e.g. 5-shot demos boosted GPT-3.5 from 7.6% to 23.9%
Share this post