Great paper from @GoogleDeepMind for enhancing reasoning of LLMs.
Training LLMs with a coaching system that rewards small steps of progress. ✨
Progress-based rewards from complementary provers enhance LLM reasoning via efficient exploration in search and RL.
Auxiliary policy feedback transforms sparse rewards into dense progress signals for LLM reasoning. ✨
• Online RL with Process Advantage Verifiers (PAVs) dense rewards:
- 5-6x improvement in sample efficiency
- >6% gain in accuracy vs. ORMs
• 8x better Pass@N performance, allowing higher ceilings with test-time re-ranking
https://arxiv.org/abs/2410.08146
Original Problem 🔍:
Process reward models (PRMs) provide step-level feedback for multi-step reasoning, but collecting dense human annotations is not scalable. Automated PRMs have shown limited gains over outcome reward models (ORMs).
-----
Solution in this Paper 🛠️:
• Proposes process rewards measuring "progress" - change in likelihood of correct final response before/after a step
• Computes advantages under a separate "prover" policy distinct from base policy being improved
• Characterizes good provers as "complementary" to base policy - having advantages that meaningfully contrast base policy steps
• Implements Process Advantage Verifiers (PAVs) to predict prover advantages
• Uses PAVs for test-time beam search and as dense rewards for online reinforcement learning
-----
Key Insights from this Paper 💡:
• Progress-based rewards enable better exploration vs. absolute Q-values
• Complementary provers, even if weaker, can substantially improve stronger base policies
• Advantages from prover policy boost sample efficiency in RL by improving step-level exploration
-----
Results 📊:
• Test-time beam search with PAVs:
- >8% more accurate than ORMs
- 1.5-5x more compute efficient
Share this post