Large Language Models don't reason. Says Apple Paper. 🤔
Reveals LLMs lack robust mathematical reasoning, relying on pattern matching rather than genuine conceptual understanding.
Now generally till now, LLMs have shown impressive performance on grade-school math tasks like GSM8K. But it's unclear if they truly have mathematical reasoning abilities or if reported metrics are reliable.
📚 https://arxiv.org/pdf/2410.05229
Solution in this Paper 🔬:
• Introduces GSM-Symbolic benchmark with templates to generate diverse question variants
• Allows evaluating LLM performance as a distribution across different instantiations
• Examines impact of changing names vs numbers, increasing difficulty, adding irrelevant info
Key Insights from this Paper 💡:
• LLMs show high variance in performance across different variants of same question
• More sensitive to changes in numbers than names
• Performance degrades and variance increases with more complex questions
• Adding irrelevant but plausible info causes major drops in accuracy (up to 65%)
• Results suggest LLMs lack true mathematical reasoning, rely on pattern matching
Results 📊:
• Performance on GSM-Symbolic lower than GSM8K for most models (e.g. 87% vs 79.1% for Gemma2-9b)
• Changing only numbers drops performance more than changing only names
• Accuracy decreases as question difficulty increases (e.g. 84.4% → 79.1% → 68.1% → 41.8% for Gemma2-9b)
• GSM-NoOp dataset with irrelevant info causes 20-65% accuracy drops across all models
Share this post