Smaller models can reason better when trained with higher learning rate to batch size ratios.
SmolTulu introduces a novel training approach for smaller language models by optimizing learning rate to batch size ratios, achieving better reasoning capabilities without increasing model size.
-----
https://arxiv.org/abs/2412.08347
🤔 Original Problem:
Training smaller language models to match the reasoning capabilities of larger models remains challenging, especially when applying techniques developed for larger architectures.
-----
🔧 Solution in this Paper:
→ SmolTulu adapts the Tulu 3 post-training pipeline specifically for smaller models by adjusting learning rate to batch size ratios
→ Higher ratios are used for reasoning tasks like ARC and GSM8K, while lower ratios work better for pattern recognition tasks like HellaSwag
→ The training process involves supervised fine-tuning, direct preference optimization, and reward modeling with carefully tuned hyperparameters
-----
🎯 Key Insights:
→ Smaller models require fundamentally different optimization strategies than larger counterparts
→ Learning rate to batch size ratio significantly impacts model performance in a task-dependent manner
→ Higher ratios help compensate for limited model capacity in complex reasoning tasks
-----
📊 Results:
→ Achieved 67.7% on IFEval for instruction following (11% improvement)
→ Scored 51.6% on GSM8K for mathematical reasoning (3.4% improvement)
→ Reached 57.1% on ARC (5.4% improvement)
Share this post