How to build a Stable Diffusion competitor for under $2000.
The paper introduces a novel approach to train large text-to-image diffusion models at micro-budgets by using deferred patch masking and mixture-of-experts, reducing training costs by 118x.
-----
https://arxiv.org/abs/2407.15811
Solution in this Paper 💡:
→ They propose randomly masking up to 75% of image patches during transformer training.
→ A lightweight patch-mixer preprocesses all patches before masking, helping retain semantic information.
→ The deferred masking happens after patch mixing, unlike traditional approaches that mask input directly.
→ They incorporate mixture-of-experts layers and layer-wise scaling in the transformer architecture.
→ Training uses both real and synthetic images (37M total) with a sparse 1.16B parameter transformer.
-----
Key Insights from this Paper 🔍:
→ Deferred masking outperforms naive masking at high ratios (75%+)
→ Synthetic data significantly improves image quality and prompt alignment
→ Smaller datasets can achieve competitive results with optimized architecture
-----
Results 📊:
→ Achieved 12.7 FID score on COCO dataset
→ Total training cost: $1,890 (2.6 days on 8×H100 GPUs)
→ 118× lower cost than Stable Diffusion models
→ 14× cheaper than current state-of-art approach ($28,400)
Share this post