"Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 12, 2025

Article voiceover

0:00

-5:04

https://arxiv.org/abs/2502.06703

The paper addresses the challenge of improving LLM performance during inference, specifically in reasoning tasks. Current Test-Time Scaling (TTS) methods lack systematic analysis across different models, reward systems, and task difficulty.

This paper proposes a reward-aware compute-optimal TTS strategy. It emphasizes adapting TTS based on the policy model, Process Reward Model (PRM), and problem difficulty to maximize performance, particularly for smaller LLMs.

-----

📌 Reward-aware Test-Time Scaling directly addresses compute inefficiency. It allows smaller LLMs to achieve high reasoning performance by intelligently allocating resources at inference.

📌 Process Reward Model quality is paramount for effective Test-Time Scaling. PRM generalization across diverse policy models and tasks is the key limiting factor for robust reasoning enhancement.

📌 Compute-optimal Test-Time Scaling offers an immediate practical advantage. It presents a cost-effective alternative to solely relying on scaling model parameters for improved LLM reasoning.

----------

Methods Explored in this Paper 🔧:

→ This paper explores three Test-Time Scaling (TTS) methods: Best-of-N, Beam Search, and Diverse Verifier Tree Search (DVTS).

→ It introduces a reward-aware compute-optimal TTS strategy. This strategy optimizes computation by considering the reward function alongside the policy model, compute budget, and prompt.

→ The study uses absolute Pass@1 accuracy thresholds to define problem difficulty levels: easy (50%-100%), medium (10%-50%), and hard (0%-10%). This approach contrasts with quantile-based difficulty levels used in prior works.

-----

Key Insights 💡:

→ The compute-optimal TTS strategy is not universal. It is heavily dependent on the choice of policy model, PRM, and the difficulty of the problem.

→ Process Reward Models (PRMs) significantly influence TTS performance. The effectiveness of a PRM is linked to its process supervision ability.

→ Smaller LLMs, when using compute-optimal TTS, can outperform much larger LLMs and even state-of-the-art models on complex reasoning tasks.

-----

Results 📊:

→ A 3B parameter LLM with compute-optimal TTS surpasses a 405B parameter LLM on MATH-500 and AIME24 datasets.

→ A 1.5B parameter LLM (DeepSeek-R1-Distill-Qwen-1.5B) with TTS outperforms o1-preview and o1-mini on MATH-500 and AIME24.

→ A 7B parameter LLM (DeepSeek-R1-Distill-Qwen-7B) with TTS beats o1 and DeepSeek-R1 on MATH-500 and AIME24.

Rohan's Bytes

Discussion about this post