Democratizing diffusion model training
Decentralized Diffusion Models enable training large AI models across independent clusters without requiring expensive centralized infrastructure, making high-quality model training more accessible.
-----
https://arxiv.org/abs/2501.05450
🤔 Original Problem:
Training modern diffusion models requires thousands of synchronized GPUs and high-bandwidth networks, making it expensive and inaccessible for most researchers. Stable Diffusion 1.5 needed 6,000 A100 GPU days, while Meta's MovieGen uses 6,114 H100 GPUs.
-----
🔧 Solution in this Paper:
→ The paper introduces Decentralized Diffusion Models (DDM) that splits training across independent "expert" models, each specializing in a data partition.
→ A lightweight router model learns to direct inputs to relevant experts during inference.
→ The system uses Decentralized Flow Matching to ensure experts collectively optimize the same objective as a single large model.
→ Experts can train on separate hardware without cross-communication, enabling use of scattered compute resources.
-----
🎯 Key Insights:
→ Feature-based data clustering significantly outperforms random clustering for expert specialization
→ Eight experts provide optimal balance between decentralization and memory requirements
→ Top-1 expert selection at inference gives best performance while minimizing compute
→ The approach enables distillation into a single dense model for production deployment
-----
📊 Results:
→ DDM achieves 28% lower FID score than monolithic models on ImageNet
→ 4x faster training on LAION dataset while maintaining quality
→ Scaled to 24B parameters using just 8 GPU nodes in under a week
→ Maintains performance while reducing infrastructure costs
Share this post