U-MATH challenges LLMs with real university math problems, exposing their current limitations.
U-MATH introduces a comprehensive university-level mathematical benchmark with 1,100 problems to evaluate LLMs' advanced mathematical reasoning capabilities, including visual elements and a meta-evaluation framework to assess solution judging reliability.
-----
https://arxiv.org/abs/2412.03205
🤔 Original Problem:
→ Current mathematical benchmarks for LLMs are limited to elementary/high-school problems, lack diversity, and don't adequately test visual mathematical reasoning.
→ Existing benchmarks are becoming saturated with GPT-4 achieving over 92% on GSM8K and 80% on MATH.
-----
🔍 Solution in this Paper:
→ U-MATH provides 1,100 unpublished university-level problems from real teaching materials.
→ Problems span 6 core subjects: Precalculus, Algebra, Differential Calculus, Integral Calculus, Multivariable Calculus, and Sequences & Series.
→ 20% of problems incorporate visual elements like graphs and diagrams.
→ μ-MATH meta-evaluation framework assesses LLMs' ability to judge mathematical solutions.
-----
💡 Key Insights:
→ LLMs struggle significantly with university-level mathematics, especially visual problems.
→ Specialized math models outperform general-purpose LLMs on text-only problems.
→ Solution assessment remains challenging even for advanced LLMs.
-----
📊 Results:
→ Best accuracy: 63% on text-based tasks, 45% on visual problems.
→ Top LLM judge achieves 80% F1-score on μ-MATH.
→ Proprietary models outperform open-source ones by 18.5% on visual tasks.
Share this post