"Free Process Rewards without Process Labels"

Playback speed

Share post at current time

0:00

Transcript

"Free Process Rewards without Process Labels"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 27, 2024

This paper introduces a method to obtain Process Reward Models automatically from Outcome Reward Models without expensive step-by-step annotations, reducing training costs by 38.8x.

-----

https://arxiv.org/abs/2412.01981

🤔 Original Problem:

Process Reward Models (PRMs) provide step-by-step evaluation of LLM reasoning but require expensive step-level annotations, making them impractical to scale.

-----

🔧 Solution in this Paper:

→ The paper parameterizes rewards as log-likelihood ratios between policy and reference models.

→ This parameterization automatically enables PRMs to emerge from standard Outcome Reward Model training.

→ The method works with various training objectives like DPO, KTO, NCA, and Cross-Entropy loss.

-----

💡 Key Insights:

→ PRMs can be obtained "for free" from Outcome Reward Models without extra training

→ The approach reduces data collection and training overhead by 38.8x

→ Reference models can be omitted without hurting accuracy

-----

📊 Results: