OLMo2 has set a new standard for open-source release. 🫡
It released all artifacts: Models, datasets, training/eval/data code. And the cherry on top is the wandb logs
So if you want to cook up a state-of-the-art LLM? OLMo 2 shares the complete recipe.
-----
https://arxiv.org/abs/2501.00656
🔧 Key methods in this Paper:
→ OLMo 2 introduces a two-stage training approach: pretraining on 4-5T tokens and mid-training on specialized Dolmino Mix 1124.
→ The architecture features improved stability through RMSNorm, reordered normalization, and QK-norm for attention computation.
→ A three-phase instruction tuning pipeline combines supervised fine-tuning, direct preference optimization, and reinforcement learning with verifiable rewards.
→ The training infrastructure spans two clusters (Jupiter and Augusta) with optimized workload management through the Beaker system.
-----
💡 Key Insights:
→ Training stability significantly improves by filtering repeated n-grams and using normal distribution initialization
→ Mid-training with high-quality data effectively patches model capabilities
→ Model weight averaging (souping) consistently improves performance
→ Infrastructure optimization is crucial for successful LLM training
-----
📊 Results:
→ 7B and 13B models match or outperform Llama 3.1 and Qwen 2.5 using fewer FLOPs
→ GSM8K scores: 67.5 for 7B, 75.1 for 13B
→ MMLU scores: 63.7 for 7B, 67.5 for 13B
Share this post