Pairwise reward model outperforms traditional methods in LLM best-of-N sampling by up to 60%.
Traditional reward models for best-of-N sampling assign arbitrary scores, hindering effectiveness. This paper proposes a Pairwise Reward Model (Pairwise RM) with a knockout tournament. Instead of absolute scores, the Pairwise RM compares two solutions, identifying the better one.
-----
Paper - https://arxiv.org/abs/2501.13007
Original Problem 🤔:
→ Reward models in best-of-N sampling for LLMs often provide inconsistent scores.
→ This limits their ability to reliably select the best solution among multiple generations.
-----
Solution in this Paper 💡:
→ The paper proposes a Pairwise Reward Model.
→ This model compares two candidate solutions simultaneously to determine which is correct.
→ This eliminates the need for assigning arbitrary absolute scores.
→ The model is combined with a knockout tournament for best-of-N sampling.
→ Candidate solutions are paired and compared iteratively.
→ Incorrect solutions are eliminated until only one remains.
-----
Key Insights from this Paper 🔑:
→ Pairwise comparison of solutions offers a more robust evaluation than assigning absolute scores.
→ This approach enables cross-validation and improves selection reliability.
→ A knockout tournament provides an efficient way to perform best-of-N sampling with pairwise comparisons.
-----
Results ✅:
→ On MATH-500’s challenging problems (top 50% difficulty), Pairwise RM improves by 40% to 60% over baselines.
→ Outperforms existing discriminative and generative reward models on MATH-500 and Olympiad Bench datasets.
Share this post