0:00
/
0:00
Transcript

"Decentralized Diffusion Models"

Generated below podcast on this paper with Google's Illuminate.

Democratizing diffusion model training

Decentralized Diffusion Models enable training large AI models across independent clusters without requiring expensive centralized infrastructure, making high-quality model training more accessible.

-----

https://arxiv.org/abs/2501.05450

🤔 Original Problem:

Training modern diffusion models requires thousands of synchronized GPUs and high-bandwidth networks, making it expensive and inaccessible for most researchers. Stable Diffusion 1.5 needed 6,000 A100 GPU days, while Meta's MovieGen uses 6,114 H100 GPUs.

-----

🔧 Solution in this Paper:

→ The paper introduces Decentralized Diffusion Models (DDM) that splits training across independent "expert" models, each specializing in a data partition.

→ A lightweight router model learns to direct inputs to relevant experts during inference.

→ The system uses Decentralized Flow Matching to ensure experts collectively optimize the same objective as a single large model.

→ Experts can train on separate hardware without cross-communication, enabling use of scattered compute resources.

-----

🎯 Key Insights:

→ Feature-based data clustering significantly outperforms random clustering for expert specialization

→ Eight experts provide optimal balance between decentralization and memory requirements

→ Top-1 expert selection at inference gives best performance while minimizing compute

→ The approach enables distillation into a single dense model for production deployment

-----

📊 Results:

→ DDM achieves 28% lower FID score than monolithic models on ImageNet

→ 4x faster training on LAION dataset while maintaining quality

→ Scaled to 24B parameters using just 8 GPU nodes in under a week

→ Maintains performance while reducing infrastructure costs

Discussion about this video

User's avatar