0:00
/
0:00
Transcript

"mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts Training"

Generated below podcast on this paper with Google's Illuminate.

Regional optical switching enables efficient MoE training by adapting to real-time traffic patterns.

mFabric dynamically rewires network paths during MoE training to optimize expert communication.

mFabric introduces a dynamic network architecture that reconfigures itself during MoE model training to optimize communication between experts efficiently.

-----

https://arxiv.org/abs/2501.03905

🤖 Original Problem:

→ Current GPU interconnects use static network topologies that can't adapt to MoE models' dynamic communication patterns

→ MoE models require frequent all-to-all communications between experts that vary unpredictably during training

→ Existing solutions waste bandwidth and slow down training due to inflexible architectures

-----

🔍 Key Insights:

→ MoE's communication patterns show strong locality within expert groups

→ Expert computation phases provide windows to hide network reconfiguration

→ Regional reconfiguration is more practical than global reconfiguration

-----

⚡ Solution in this Paper:

→ mFabric introduces regionally reconfigurable high-bandwidth domains using optical circuit switching

→ It monitors traffic patterns and predicts communication demands between experts

→ The system reconfigures optical circuits during expert computation phases to optimize bandwidth

→ A custom communication manager routes different types of parallel traffic through appropriate paths

-----

📊 Results:

→ Matches performance of non-blocking fat-tree networks while reducing costs by 1.2x-1.5x at 100Gbps

→ Improves cost efficiency by 1.9x-2.3x at 400Gbps

→ Scales effectively to 30K+ GPUs

→ Outperforms TopoOpt by 2.5x in training speed

------

Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓

🎉 https://rohanpaul.substack.com/

Discussion about this video