This paper fixes math reasoning verification by making sure multiple verifiers agree.
The paper proposes a consensus filtering mechanism that combines Monte Carlo estimation with LLM-as-judge to improve Process Reward Models in mathematical reasoning.
-----
https://arxiv.org/abs/2501.07301
🤔 Original Problem:
→ Current Process Reward Models (PRMs) struggle with data quality and evaluation metrics, leading to unreliable mathematical reasoning verification
→ Monte Carlo estimation methods produce noisy data, while human annotation is expensive
-----
🔧 Solution in this Paper:
→ Introduces a consensus filtering mechanism that only keeps data samples where both Monte Carlo estimation and LLM-as-judge agree on error locations
→ Implements hard labels instead of soft labels for training, treating steps as correct only if they can lead to correct answers
→ Combines response-level and step-level metrics for more comprehensive evaluation
→ Uses Qwen2.5-72B-Instruct as the judge model to verify reasoning steps
-----
💡 Key Insights:
→ Monte Carlo estimation alone yields inferior performance compared to LLM-as-judge approaches
→ Best-of-N evaluation can be misleading due to correct answers from flawed reasoning processes
→ PRMs trained solely on Best-of-N tend to drift towards outcome-based rather than process-based assessment
-----
📊 Results:
→ New PRM achieves 69.3% accuracy on Best-of-8 evaluation, outperforming existing models
→ Demonstrates 78.3% F1 score on ProcessBench, significantly higher than baseline 56.5%
→ Reduces training data by 60% while maintaining performance through consensus filtering
Share this post