"s1: Simple test-time scaling"

Below podcast on this paper is generated with Google's Illuminate.

Feb 08, 2025

→ Replicating test-time scaling for reasoning in LLMs is complex.

→ Existing methods often involve reinforcement learning and large datasets.

The paper introduces budget forcing, a simple technique to control computation time during inference, and a curated dataset for efficient training.

-----

https://arxiv.org/abs/2501.19393

📌 Budget forcing offers a surprisingly simple yet effective method to achieve test-time scaling. It directly manipulates the LLM's inference process for controlled compute and performance gains, without complex RL.

📌 The s1K dataset demonstrates that high-quality, diverse, and difficult data is paramount for efficient reasoning fine-tuning. Only 1,000 carefully selected examples are sufficient for strong performance.

📌 The s1-32B model showcases the power of targeted supervised fine-tuning and budget forcing. This combination allows open-source models to approach closed-source performance with minimal training data and compute.

-----

Methods in this Paper 🔧:

→ Budget forcing controls the thinking duration of an LLM.

→ It forcefully terminates or lengthens the model's reasoning process.

→ Lengthening is achieved by appending "Wait" to encourage further thinking.

→ A small dataset s1K of 1,000 high-quality reasoning examples was created.

→ Supervised finetuning on s1K with budget forcing creates the s1-32B model.

-----

Key Insights 💡:

→ Training on only 1,000 carefully selected samples is sufficient for strong reasoning.

→ Budget forcing enables effective test-time scaling with simple next-token prediction training.

→ Careful data selection based on difficulty, diversity, and quality is crucial for sample efficiency.

→ Sequential scaling methods like budget forcing are more effective than parallel methods for reasoning.

-----

Results 📊:

→ s1-32B exceeds o1-preview by up to 27% on competition math questions.

→ s1-32B achieves 57% on AIME24 through test-time scaling, up from 50% without intervention.

→ Budget forcing provides 100% controllability over test-time compute based on defined metrics.

→ s1-32B is the most sample-efficient open data reasoning model compared to models like r1-distill and Bespoke-Stratos.

Rohan's Bytes

Discussion about this post