HARP introduces a challenging math benchmark dataset with 5,409 problems from US national math competitions, featuring multiple difficulty levels and human-written solutions to evaluate LLM reasoning capabilities.
-----
https://arxiv.org/abs/2412.08819
Original Problem 🤔:
→ Existing math benchmarks like MATH and GSM8k are becoming saturated with frontier models achieving over 90% accuracy, making it difficult to meaningfully evaluate and improve mathematical reasoning capabilities.
-----
Solution in this Paper 🔧:
→ HARP sources 5,409 problems from prestigious US math competitions including AMC, AIME, and USAMO.
→ The dataset features 4,780 short-answer problems with programmatically-checkable answers across 6 difficulty levels.
→ Each problem includes multiple human-written solutions (average 2.14 solutions per problem) and expert-annotated difficulty levels and subject labels.
→ For 4,110 problems, multiple choice options are provided to enable research on different evaluation formats.
-----
Key Insights 💡:
→ Models intrinsically scale their inference-time compute for harder problems
→ Multiple choice format improves performance compared to short answer format
→ Performance drops significantly with increasing problem difficulty across all models
→ Models struggle most with Geometry and Precalculus problems
-----
Results 📊:
→ Best performing model (o1-mini) achieves only 41.1% accuracy on highest difficulty problems
→ Gemini 1.5 Pro reaches just 9.6% on hardest problems
→ Multiple choice evaluation achieves 80% accuracy vs 64.2% for short answer format
Share this post