0:00
/
0:00
Transcript

"SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs"

The podcast on this paper is generated with Google's Illuminate.

Smaller models can reason better when trained with higher learning rate to batch size ratios.

SmolTulu introduces a novel training approach for smaller language models by optimizing learning rate to batch size ratios, achieving better reasoning capabilities without increasing model size.

-----

https://arxiv.org/abs/2412.08347

🤔 Original Problem:

Training smaller language models to match the reasoning capabilities of larger models remains challenging, especially when applying techniques developed for larger architectures.

-----

🔧 Solution in this Paper:

→ SmolTulu adapts the Tulu 3 post-training pipeline specifically for smaller models by adjusting learning rate to batch size ratios

→ Higher ratios are used for reasoning tasks like ARC and GSM8K, while lower ratios work better for pattern recognition tasks like HellaSwag

→ The training process involves supervised fine-tuning, direct preference optimization, and reward modeling with carefully tuned hyperparameters

-----

🎯 Key Insights:

→ Smaller models require fundamentally different optimization strategies than larger counterparts

→ Learning rate to batch size ratio significantly impacts model performance in a task-dependent manner

→ Higher ratios help compensate for limited model capacity in complex reasoning tasks

-----

📊 Results:

→ Achieved 67.7% on IFEval for instruction following (11% improvement)

→ Scored 51.6% on GSM8K for mathematical reasoning (3.4% improvement)

→ Reached 57.1% on ARC (5.4% improvement)

Discussion about this video