0:00
/
0:00
Transcript

"GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models"

Generated this podcast with Google's Illuminate.

Large Language Models don't reason. Says Apple Paper. 🤔

Reveals LLMs lack robust mathematical reasoning, relying on pattern matching rather than genuine conceptual understanding.

Now generally till now, LLMs have shown impressive performance on grade-school math tasks like GSM8K. But it's unclear if they truly have mathematical reasoning abilities or if reported metrics are reliable.

📚 https://arxiv.org/pdf/2410.05229

Solution in this Paper 🔬:

• Introduces GSM-Symbolic benchmark with templates to generate diverse question variants

• Allows evaluating LLM performance as a distribution across different instantiations

• Examines impact of changing names vs numbers, increasing difficulty, adding irrelevant info

Key Insights from this Paper 💡:

• LLMs show high variance in performance across different variants of same question

• More sensitive to changes in numbers than names

• Performance degrades and variance increases with more complex questions

• Adding irrelevant but plausible info causes major drops in accuracy (up to 65%)

• Results suggest LLMs lack true mathematical reasoning, rely on pattern matching

Results 📊:

• Performance on GSM-Symbolic lower than GSM8K for most models (e.g. 87% vs 79.1% for Gemma2-9b)

• Changing only numbers drops performance more than changing only names

• Accuracy decreases as question difficulty increases (e.g. 84.4% → 79.1% → 68.1% → 41.8% for Gemma2-9b)

• GSM-NoOp dataset with irrelevant info causes 20-65% accuracy drops across all models

Discussion about this video