Regional optical switching enables efficient MoE training by adapting to real-time traffic patterns.
mFabric dynamically rewires network paths during MoE training to optimize expert communication.
mFabric introduces a dynamic network architecture that reconfigures itself during MoE model training to optimize communication between experts efficiently.
-----
https://arxiv.org/abs/2501.03905
🤖 Original Problem:
→ Current GPU interconnects use static network topologies that can't adapt to MoE models' dynamic communication patterns
→ MoE models require frequent all-to-all communications between experts that vary unpredictably during training
→ Existing solutions waste bandwidth and slow down training due to inflexible architectures
-----
🔍 Key Insights:
→ MoE's communication patterns show strong locality within expert groups
→ Expert computation phases provide windows to hide network reconfiguration
→ Regional reconfiguration is more practical than global reconfiguration
-----
⚡ Solution in this Paper:
→ mFabric introduces regionally reconfigurable high-bandwidth domains using optical circuit switching
→ It monitors traffic patterns and predicts communication demands between experts
→ The system reconfigures optical circuits during expert computation phases to optimize bandwidth
→ A custom communication manager routes different types of parallel traffic through appropriate paths
-----
📊 Results:
→ Matches performance of non-blocking fat-tree networks while reducing costs by 1.2x-1.5x at 100Gbps
→ Improves cost efficiency by 1.9x-2.3x at 400Gbps
→ Scales effectively to 30K+ GPUs
→ Outperforms TopoOpt by 2.5x in training speed
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post