ProcessBench introduces a benchmark to identify mathematical reasoning errors in LLMs by evaluating 3,400 test cases with expert-annotated error locations, focusing on competition and Olympiad-level problems.
-----
https://arxiv.org/abs/2412.06559
🔧 ProcessBench introduced in this Paper:
→ ProcessBench evaluates models' ability to identify the earliest error in step-by-step mathematical solutions.
→ The benchmark contains 3,400 test cases, primarily covering competition and Olympiad-level problems.
→ Human experts annotate error locations in solutions generated by various open-source models.
→ Two types of models are evaluated: Process Reward Models (PRMs) and critic models.
-----
🎯 Key Insights:
→ Current Process reward models (PRMs) struggle to generalize beyond simple math problems
→ General LLMs show better error identification capabilities than specialized PRMs
→ QwQ-32B-Preview matches GPT-4 in critique capabilities but lags behind o1-mini
-----
📊 Results:
→ Best open-source model (QwQ-32B-Preview) achieves 71.5% F1 score
→ Proprietary model o1-mini leads with 87.9% F1 score
→ Process reward models (PRMs) show significant performance drop on Olympiad-level problems
Share this post