0:00
/
0:00
Transcript

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

The podcast on this paper is generated with Google's Illuminate.

Great paper from @GoogleDeepMind for enhancing reasoning of LLMs.

Training LLMs with a coaching system that rewards small steps of progress. ✨

Progress-based rewards from complementary provers enhance LLM reasoning via efficient exploration in search and RL.

Auxiliary policy feedback transforms sparse rewards into dense progress signals for LLM reasoning. ✨

• Online RL with Process Advantage Verifiers (PAVs) dense rewards:

- 5-6x improvement in sample efficiency

- >6% gain in accuracy vs. ORMs

• 8x better Pass@N performance, allowing higher ceilings with test-time re-ranking

https://arxiv.org/abs/2410.08146

Original Problem 🔍:

Process reward models (PRMs) provide step-level feedback for multi-step reasoning, but collecting dense human annotations is not scalable. Automated PRMs have shown limited gains over outcome reward models (ORMs).

-----

Solution in this Paper 🛠️:

• Proposes process rewards measuring "progress" - change in likelihood of correct final response before/after a step

• Computes advantages under a separate "prover" policy distinct from base policy being improved

• Characterizes good provers as "complementary" to base policy - having advantages that meaningfully contrast base policy steps

• Implements Process Advantage Verifiers (PAVs) to predict prover advantages

• Uses PAVs for test-time beam search and as dense rewards for online reinforcement learning

-----

Key Insights from this Paper 💡:

• Progress-based rewards enable better exploration vs. absolute Q-values

• Complementary provers, even if weaker, can substantially improve stronger base policies

• Advantages from prover policy boost sample efficiency in RL by improving step-level exploration

-----

Results 📊:

• Test-time beam search with PAVs:

- >8% more accurate than ORMs

- 1.5-5x more compute efficient

Discussion about this video