Process-level Reward Models need better error detection - PRMBENCH shows why
PRMBENCH is a comprehensive benchmark that evaluates Process-level Reward Models' ability to detect various reasoning errors, featuring 6,216 problems and 83,456 step-level labels.
-----
https://arxiv.org/abs/2501.03124
Original Problem 🤔:
Current benchmarks for evaluating Process-level Reward Models (PRMs) only focus on binary correctness checks, missing the ability to detect nuanced reasoning errors like redundancy or domain inconsistency.
-----
Solution in this Paper 🛠️:
→ PRMBENCH evaluates PRMs across three dimensions: simplicity, soundness, and sensitivity
→ The benchmark includes 6,216 carefully designed problems with fine-grained error categories
→ Each problem tests specific error detection capabilities like redundancy, circular logic, and domain consistency
→ The evaluation framework uses both synthetic and human-verified data generation methods
→ Quality control involves strict filtering rules and manual verification of 10% samples
-----
Key Insights 🔍:
→ PRMs struggle with detecting subtle reasoning errors beyond basic correctness checks
→ Even top models like Gemini-2-Thinking perform only marginally above random guessing
→ Open-source PRMs significantly underperform compared to proprietary LLMs
→ Models show clear bias towards positive or negative rewards in certain scenarios
-----
Results 📊:
→ Best model (Gemini-2-Thinking) achieves only 68.8% accuracy
→ Open-source PRMs average 50.1% accuracy
→ Human verification shows 92% qualification rate on correctness
Share this post