"s1: Simple test-time scaling"
Below podcast on this paper is generated with Google's Illuminate.
→ Replicating test-time scaling for reasoning in LLMs is complex.
→ Existing methods often involve reinforcement learning and large datasets.
The paper introduces budget forcing, a simple technique to control computation time during inference, and a curated dataset for efficient training.
-----
https://arxiv.org/abs/2501.19393
📌 Budget forcing offers a surprisingly simple yet effective method to achieve test-time scaling. It directly manipulates the LLM's inference process for controlled compute and performance gains, without complex RL.
📌 The s1K dataset demonstrates that high-quality, diverse, and difficult data is paramount for efficient reasoning fine-tuning. Only 1,000 carefully selected examples are sufficient for strong performance.
📌 The s1-32B model showcases the power of targeted supervised fine-tuning and budget forcing. This combination allows open-source models to approach closed-source performance with minimal training data and compute.
-----
Methods in this Paper 🔧:
→ Budget forcing controls the thinking duration of an LLM.
→ It forcefully terminates or lengthens the model's reasoning process.
→ Lengthening is achieved by appending "Wait" to encourage further thinking.
→ A small dataset s1K of 1,000 high-quality reasoning examples was created.
→ Supervised finetuning on s1K with budget forcing creates the s1-32B model.
-----
Key Insights 💡:
→ Training on only 1,000 carefully selected samples is sufficient for strong reasoning.
→ Budget forcing enables effective test-time scaling with simple next-token prediction training.
→ Careful data selection based on difficulty, diversity, and quality is crucial for sample efficiency.
→ Sequential scaling methods like budget forcing are more effective than parallel methods for reasoning.
-----
Results 📊:
→ s1-32B exceeds o1-preview by up to 27% on competition math questions.
→ s1-32B achieves 57% on AIME24 through test-time scaling, up from 50% without intervention.
→ Budget forcing provides 100% controllability over test-time compute based on defined metrics.
→ s1-32B is the most sample-efficient open data reasoning model compared to models like r1-distill and Bespoke-Stratos.