0:00
/
0:00
Transcript

"ProcessBench: Identifying Process Errors in Mathematical Reasoning"

The podcast on this paper is generated with Google's Illuminate.

ProcessBench introduces a benchmark to identify mathematical reasoning errors in LLMs by evaluating 3,400 test cases with expert-annotated error locations, focusing on competition and Olympiad-level problems.

-----

https://arxiv.org/abs/2412.06559

🔧 ProcessBench introduced in this Paper:

→ ProcessBench evaluates models' ability to identify the earliest error in step-by-step mathematical solutions.

→ The benchmark contains 3,400 test cases, primarily covering competition and Olympiad-level problems.

→ Human experts annotate error locations in solutions generated by various open-source models.

→ Two types of models are evaluated: Process Reward Models (PRMs) and critic models.

-----

🎯 Key Insights:

→ Current Process reward models (PRMs) struggle to generalize beyond simple math problems

→ General LLMs show better error identification capabilities than specialized PRMs

→ QwQ-32B-Preview matches GPT-4 in critique capabilities but lags behind o1-mini

-----

📊 Results:

→ Best open-source model (QwQ-32B-Preview) achieves 71.5% F1 score

→ Proprietary model o1-mini leads with 87.9% F1 score

→ Process reward models (PRMs) show significant performance drop on Olympiad-level problems

Discussion about this video