This paper introduces UGMathBench, a dynamic benchmark for evaluating undergraduate-level mathematical reasoning with LLMs.
It addresses limitations of existing benchmarks, which often lack coverage or suffer from test-set contamination.
-----
Paper - https://arxiv.org/abs/2501.13766
Original Problem 😠:
→ Existing mathematical reasoning benchmarks are insufficient for evaluating LLMs.
→ These benchmarks lack extensive coverage of undergraduate-level math problems.
→ They may suffer from test-set contamination.
-----
Solution in this Paper 🤔:
→ UGMathBench is proposed as a diverse and dynamic benchmark.
→ It comprises 5,062 undergraduate-level problems across 16 subjects and 111 topics, with 10 distinct answer types.
→ Each problem has three randomized versions, with plans for more versions.
→ Two new metrics are also proposed: effective accuracy (EAcc) and reasoning gap (Δ).
→ EAcc measures the percentage of correctly solved problems across all versions.
→ Reasoning gap assesses reasoning robustness by calculating the difference between average accuracy and EAcc.
-----
Key Insights from this Paper 😲:
→ Current LLMs struggle with undergraduate-level mathematical reasoning.
→ The best EAcc achieved is only 56.3% by OpenAI-01-mini.
→ All evaluated LLMs exhibit a large reasoning gap.
→ Calculation errors are a major concern for LLMs in solving math problems.
-----
Results 😎:
→ OpenAI-01-mini achieves the highest EAcc of 56.30%.
→ All LLMs exhibit high reasoning gaps, with Robustness Efficiency (RE) ranging from 20.78% to 196.6%.
→ Arithmetic has the highest average EAcc (62.8%) among all subjects.
Share this post