This paper introduces a method to obtain Process Reward Models automatically from Outcome Reward Models without expensive step-by-step annotations, reducing training costs by 38.8x.
-----
https://arxiv.org/abs/2412.01981
🤔 Original Problem:
Process Reward Models (PRMs) provide step-by-step evaluation of LLM reasoning but require expensive step-level annotations, making them impractical to scale.
-----
🔧 Solution in this Paper:
→ The paper parameterizes rewards as log-likelihood ratios between policy and reference models.
→ This parameterization automatically enables PRMs to emerge from standard Outcome Reward Model training.
→ The method works with various training objectives like DPO, KTO, NCA, and Cross-Entropy loss.
-----
💡 Key Insights:
→ PRMs can be obtained "for free" from Outcome Reward Models without extra training
→ The approach reduces data collection and training overhead by 38.8x
→ Reference models can be omitted without hurting accuracy
-----
📊 Results:
→ Outperforms MCTS-based baseline using less than 1/38 training data
→ Achieves 50.4% average accuracy with DPO variant
→ Works effectively even with imbalanced, unpaired data using CE loss
Share this post