"s1: Simple test-time scaling"
Below podcast on this paper is generated with Google's Illuminate.
โ Replicating test-time scaling for reasoning in LLMs is complex.
โ Existing methods often involve reinforcement learning and large datasets.
The paper introduces budget forcing, a simple technique to control computation time during inference, and a curated dataset for efficient training.
-----
https://arxiv.org/abs/2501.19393
๐ Budget forcing offers a surprisingly simple yet effective method to achieve test-time scaling. It directly manipulates the LLM's inference process for controlled compute and performance gains, without complex RL.
๐ The s1K dataset demonstrates that high-quality, diverse, and difficult data is paramount for efficient reasoning fine-tuning. Only 1,000 carefully selected examples are sufficient for strong performance.
๐ The s1-32B model showcases the power of targeted supervised fine-tuning and budget forcing. This combination allows open-source models to approach closed-source performance with minimal training data and compute.
-----
Methods in this Paper ๐ง:
โ Budget forcing controls the thinking duration of an LLM.
โ It forcefully terminates or lengthens the model's reasoning process.
โ Lengthening is achieved by appending "Wait" to encourage further thinking.
โ A small dataset s1K of 1,000 high-quality reasoning examples was created.
โ Supervised finetuning on s1K with budget forcing creates the s1-32B model.
-----
Key Insights ๐ก:
โ Training on only 1,000 carefully selected samples is sufficient for strong reasoning.
โ Budget forcing enables effective test-time scaling with simple next-token prediction training.
โ Careful data selection based on difficulty, diversity, and quality is crucial for sample efficiency.
โ Sequential scaling methods like budget forcing are more effective than parallel methods for reasoning.
-----
Results ๐:
โ s1-32B exceeds o1-preview by up to 27% on competition math questions.
โ s1-32B achieves 57% on AIME24 through test-time scaling, up from 50% without intervention.
โ Budget forcing provides 100% controllability over test-time compute based on defined metrics.
โ s1-32B is the most sample-efficient open data reasoning model compared to models like r1-distill and Bespoke-Stratos.


