Tulu 3 cracks open the black box of LLM post-training with fully transparent recipes and data
Tulu 3 introduces a comprehensive open-source post-training framework for LLMs that matches or exceeds proprietary models. It provides complete training recipes, datasets, and evaluation tools to enhance core capabilities like reasoning, math, and instruction following through a multi-stage pipeline combining supervised finetuning, preference optimization, and reinforcement learning.
-----
https://arxiv.org/abs/2411.15124
🤔 Original Problem:
Open-source post-training methods lag behind proprietary ones due to limited transparency in training data and recipes, making it difficult to achieve comparable performance in core capabilities.
-----
🛠️ Solution in this Paper:
→ Tulu 3 implements a four-stage post-training pipeline starting with careful data curation targeting specific skills.
→ The supervised finetuning stage uses a mix of public and synthetic data optimized for core capabilities.
→ Direct Preference Optimization applies on-policy preference data generated from model comparisons.
→ A novel Reinforcement Learning with Verifiable Rewards (RLVR) stage enhances specific skills using ground-truth outcomes.
-----
💡 Key Insights:
→ Targeted synthetic data significantly improves core skills while maintaining general capabilities
→ On-policy preference data generation scales better than traditional methods
→ Verifiable rewards in RL outperform standard reward modeling approaches
→ Infrastructure scaling and efficient batch processing are crucial for 70B parameter models
-----
📊 Results:
→ Tulu 3 70B surpasses Llama 3.1 Instruct, Qwen 2.5, and Mistral on core benchmarks
→ Matches or exceeds closed models like GPT-4o-mini and Claude 3.5-Haiku
→ Achieves 76% average score across diverse evaluation tasks
Share this post