"Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.06703
The paper addresses the challenge of improving LLM performance during inference, specifically in reasoning tasks. Current Test-Time Scaling (TTS) methods lack systematic analysis across different models, reward systems, and task difficulty.
This paper proposes a reward-aware compute-optimal TTS strategy. It emphasizes adapting TTS based on the policy model, Process Reward Model (PRM), and problem difficulty to maximize performance, particularly for smaller LLMs.
-----
📌 Reward-aware Test-Time Scaling directly addresses compute inefficiency. It allows smaller LLMs to achieve high reasoning performance by intelligently allocating resources at inference.
📌 Process Reward Model quality is paramount for effective Test-Time Scaling. PRM generalization across diverse policy models and tasks is the key limiting factor for robust reasoning enhancement.
📌 Compute-optimal Test-Time Scaling offers an immediate practical advantage. It presents a cost-effective alternative to solely relying on scaling model parameters for improved LLM reasoning.
----------
Methods Explored in this Paper 🔧:
→ This paper explores three Test-Time Scaling (TTS) methods: Best-of-N, Beam Search, and Diverse Verifier Tree Search (DVTS).
→ It introduces a reward-aware compute-optimal TTS strategy. This strategy optimizes computation by considering the reward function alongside the policy model, compute budget, and prompt.
→ The study uses absolute Pass@1 accuracy thresholds to define problem difficulty levels: easy (50%-100%), medium (10%-50%), and hard (0%-10%). This approach contrasts with quantile-based difficulty levels used in prior works.
-----
Key Insights 💡:
→ The compute-optimal TTS strategy is not universal. It is heavily dependent on the choice of policy model, PRM, and the difficulty of the problem.
→ Process Reward Models (PRMs) significantly influence TTS performance. The effectiveness of a PRM is linked to its process supervision ability.
→ Smaller LLMs, when using compute-optimal TTS, can outperform much larger LLMs and even state-of-the-art models on complex reasoning tasks.
-----
Results 📊:
→ A 3B parameter LLM with compute-optimal TTS surpasses a 405B parameter LLM on MATH-500 and AIME24 datasets.
→ A 1.5B parameter LLM (DeepSeek-R1-Distill-Qwen-1.5B) with TTS outperforms o1-preview and o1-mini on MATH-500 and AIME24.
→ A 7B parameter LLM (DeepSeek-R1-Distill-Qwen-7B) with TTS beats o1 and DeepSeek-R1 on MATH-500 and AIME24.