New way to optimally upcycle dense LLMs to sparse mixture-of-experts (MoE)boosts performance without full retraining.
📚 https://arxiv.org/abs/2410.07524
Original Problem 🔍:
Upcycling pre-trained dense LLMs into sparse mixture-of-experts (MoE) models efficiently increases model capacity without training from scratch. However, optimal techniques for upcycling at scale remain unclear.
-----
Solution in this Paper 🛠️:
• Proposes "virtual group" initialization for fine-grained MoE upcycling
• Introduces weight scaling approach for both coarse and fine-grained MoEs
• Compares softmax-then-topK vs topK-then-softmax expert routing
• Assesses benefits of higher granularity MoEs and higher topK values
• Provides training recipes for billion-parameter scale LLM upcycling
-----
Key Insights from this Paper 💡:
• Upcycling outperforms continued dense model training
• Softmax-then-topK routing improves over topK-then-softmax
• Higher granularity MoEs can boost accuracy but require careful tuning
• Resetting learning rate to peak pre-training levels improves upcycling
• Larger batch sizes (up to a point) improve convergence and efficiency
-----
Results 📊:
• For a 64 experts MoE top-8 model (E8G8T8) upcycled model: 4.1% lower validation loss than dense continued training
• E8G8T8 MMLU score: 66.2 vs 65.3 for dense continued training
• Weight scaling brings 1.5% better loss to both coarse and fine-grained MoEs
Share this post