0:00
/
0:00
Transcript

"UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models"

Below podcast on this paper is generated with Google's Illuminate.

This paper introduces UGMathBench, a dynamic benchmark for evaluating undergraduate-level mathematical reasoning with LLMs.

It addresses limitations of existing benchmarks, which often lack coverage or suffer from test-set contamination.

-----

Paper - https://arxiv.org/abs/2501.13766

Original Problem 😠:

→ Existing mathematical reasoning benchmarks are insufficient for evaluating LLMs.

→ These benchmarks lack extensive coverage of undergraduate-level math problems.

→ They may suffer from test-set contamination.

-----

Solution in this Paper 🤔:

→ UGMathBench is proposed as a diverse and dynamic benchmark.

→ It comprises 5,062 undergraduate-level problems across 16 subjects and 111 topics, with 10 distinct answer types.

→ Each problem has three randomized versions, with plans for more versions.

→ Two new metrics are also proposed: effective accuracy (EAcc) and reasoning gap (Δ).

→ EAcc measures the percentage of correctly solved problems across all versions.

→ Reasoning gap assesses reasoning robustness by calculating the difference between average accuracy and EAcc.

-----

Key Insights from this Paper 😲:

→ Current LLMs struggle with undergraduate-level mathematical reasoning.

→ The best EAcc achieved is only 56.3% by OpenAI-01-mini.

→ All evaluated LLMs exhibit a large reasoning gap.

→ Calculation errors are a major concern for LLMs in solving math problems.

-----

Results 😎:

→ OpenAI-01-mini achieves the highest EAcc of 56.30%.

→ All LLMs exhibit high reasoning gaps, with Robustness Efficiency (RE) ranging from 20.78% to 196.6%.

→ Arithmetic has the highest average EAcc (62.8%) among all subjects.

Discussion about this video