0:00
/
0:00
Transcript

Upcycling Large Language Models into Mixture of Experts

Generated this podcast with Google's Illuminate.

New way to optimally upcycle dense LLMs to sparse mixture-of-experts (MoE)boosts performance without full retraining.

📚 https://arxiv.org/abs/2410.07524

Original Problem 🔍:

Upcycling pre-trained dense LLMs into sparse mixture-of-experts (MoE) models efficiently increases model capacity without training from scratch. However, optimal techniques for upcycling at scale remain unclear.

-----

Solution in this Paper 🛠️:

• Proposes "virtual group" initialization for fine-grained MoE upcycling

• Introduces weight scaling approach for both coarse and fine-grained MoEs

• Compares softmax-then-topK vs topK-then-softmax expert routing

• Assesses benefits of higher granularity MoEs and higher topK values

• Provides training recipes for billion-parameter scale LLM upcycling

-----

Key Insights from this Paper 💡:

• Upcycling outperforms continued dense model training

• Softmax-then-topK routing improves over topK-then-softmax

• Higher granularity MoEs can boost accuracy but require careful tuning

• Resetting learning rate to peak pre-training levels improves upcycling

• Larger batch sizes (up to a point) improve convergence and efficiency

-----

Results 📊:

• For a 64 experts MoE top-8 model (E8G8T8) upcycled model: 4.1% lower validation loss than dense continued training

• E8G8T8 MMLU score: 66.2 vs 65.3 for dense continued training

• Weight scaling brings 1.5% better loss to both coarse and fine-grained MoEs

Discussion about this video

User's avatar