0:00
/
0:00
Transcript

"2 OLMo 2 Furious"

Generated below podcast on this paper with Google's Illuminate.

OLMo2 has set a new standard for open-source release. 🫡

It released all artifacts: Models, datasets, training/eval/data code. And the cherry on top is the wandb logs

So if you want to cook up a state-of-the-art LLM? OLMo 2 shares the complete recipe.

-----

https://arxiv.org/abs/2501.00656

🔧 Key methods in this Paper:

→ OLMo 2 introduces a two-stage training approach: pretraining on 4-5T tokens and mid-training on specialized Dolmino Mix 1124.

→ The architecture features improved stability through RMSNorm, reordered normalization, and QK-norm for attention computation.

→ A three-phase instruction tuning pipeline combines supervised fine-tuning, direct preference optimization, and reinforcement learning with verifiable rewards.

→ The training infrastructure spans two clusters (Jupiter and Augusta) with optimized workload management through the Beaker system.

-----

💡 Key Insights:

→ Training stability significantly improves by filtering repeated n-grams and using normal distribution initialization

→ Mid-training with high-quality data effectively patches model capabilities

→ Model weight averaging (souping) consistently improves performance

→ Infrastructure optimization is crucial for successful LLM training

-----

📊 Results:

→ 7B and 13B models match or outperform Llama 3.1 and Qwen 2.5 using fewer FLOPs

→ GSM8K scores: 67.5 for 7B, 75.1 for 13B

→ MMLU scores: 63.7 for 7B, 67.5 for 13B

Discussion about this video

User's avatar