Research MoE architectures with LibMoE, proposed in this paper, using just 4 GPUs and 2 days of training.
https://arxiv.org/abs/2411.00918
🎯 Original Problem:
Training Mixture of Experts (MoE) models requires massive compute resources (256 H100/A100 GPUs), making research inaccessible to most academics and limiting innovation in MoE architectures.
-----
🔧 Solution in this Paper:
LibMoE framework enables MoE research with just 4 A100 GPUs through:
→ Modular design with customizable MoE components, router designs, and expert interactions
→ Two-stage training: First trains MLP connector between visual encoder and LLM, then upcycles to MoE model
→ Comprehensive evaluation pipeline supporting 100+ zero-shot benchmarks
→ Integration with state-of-the-art distributed training strategies and model sharding
-----
💡 Key Insights:
→ Full training pipeline completes in 55 hours on 4 A100 GPUs, with MoE training taking 32 hours
→ Final model checkpoint often doesn't give best performance, suggesting benefits of early stopping
→ Complex tasks like code reasoning show lower entropy scores, indicating specialized expert usage
→ Despite architectural differences, all tested MoE algorithms perform similarly across benchmarks
-----
📊 Results:
→ Trained and evaluated 5 MoE algorithms across 11 zero-shot benchmarks
→ Achieved comparable performance to models trained on 256 GPUs
→ Perturbed Cosine Router showed fastest convergence in routing strategy
→ Reduced training time from weeks to ~2 days on minimal hardware
Share this post