0:00
/
0:00
Transcript

"HARP: A challenging human-annotated math reasoning benchmark"

The podcast on this paper is generated with Google's Illuminate.

HARP introduces a challenging math benchmark dataset with 5,409 problems from US national math competitions, featuring multiple difficulty levels and human-written solutions to evaluate LLM reasoning capabilities.

-----

https://arxiv.org/abs/2412.08819

Original Problem 🤔:

→ Existing math benchmarks like MATH and GSM8k are becoming saturated with frontier models achieving over 90% accuracy, making it difficult to meaningfully evaluate and improve mathematical reasoning capabilities.

-----

Solution in this Paper 🔧:

→ HARP sources 5,409 problems from prestigious US math competitions including AMC, AIME, and USAMO.

→ The dataset features 4,780 short-answer problems with programmatically-checkable answers across 6 difficulty levels.

→ Each problem includes multiple human-written solutions (average 2.14 solutions per problem) and expert-annotated difficulty levels and subject labels.

→ For 4,110 problems, multiple choice options are provided to enable research on different evaluation formats.

-----

Key Insights 💡:

→ Models intrinsically scale their inference-time compute for harder problems

→ Multiple choice format improves performance compared to short answer format

→ Performance drops significantly with increasing problem difficulty across all models

→ Models struggle most with Geometry and Precalculus problems

-----

Results 📊:

→ Best performing model (o1-mini) achieves only 41.1% accuracy on highest difficulty problems

→ Gemini 1.5 Pro reaches just 9.6% on hardest problems

→ Multiple choice evaluation achieves 80% accuracy vs 64.2% for short answer format

Discussion about this video