0:00
/
0:00
Transcript

"Free Process Rewards without Process Labels"

The podcast on this paper is generated with Google's Illuminate.

This paper introduces a method to obtain Process Reward Models automatically from Outcome Reward Models without expensive step-by-step annotations, reducing training costs by 38.8x.

-----

https://arxiv.org/abs/2412.01981

🤔 Original Problem:

Process Reward Models (PRMs) provide step-by-step evaluation of LLM reasoning but require expensive step-level annotations, making them impractical to scale.

-----

🔧 Solution in this Paper:

→ The paper parameterizes rewards as log-likelihood ratios between policy and reference models.

→ This parameterization automatically enables PRMs to emerge from standard Outcome Reward Model training.

→ The method works with various training objectives like DPO, KTO, NCA, and Cross-Entropy loss.

-----

💡 Key Insights:

→ PRMs can be obtained "for free" from Outcome Reward Models without extra training

→ The approach reduces data collection and training overhead by 38.8x

→ Reference models can be omitted without hurting accuracy

-----

📊 Results:

→ Outperforms MCTS-based baseline using less than 1/38 training data

→ Achieves 50.4% average accuracy with DPO variant

→ Works effectively even with imbalanced, unpaired data using CE loss

Discussion about this video

User's avatar