0:00
/
0:00
Transcript

"Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget"

Generated below podcast on this paper with Google's Illuminate.

How to build a Stable Diffusion competitor for under $2000.

The paper introduces a novel approach to train large text-to-image diffusion models at micro-budgets by using deferred patch masking and mixture-of-experts, reducing training costs by 118x.

-----

https://arxiv.org/abs/2407.15811

Solution in this Paper 💡:

→ They propose randomly masking up to 75% of image patches during transformer training.

→ A lightweight patch-mixer preprocesses all patches before masking, helping retain semantic information.

→ The deferred masking happens after patch mixing, unlike traditional approaches that mask input directly.

→ They incorporate mixture-of-experts layers and layer-wise scaling in the transformer architecture.

→ Training uses both real and synthetic images (37M total) with a sparse 1.16B parameter transformer.

-----

Key Insights from this Paper 🔍:

→ Deferred masking outperforms naive masking at high ratios (75%+)

→ Synthetic data significantly improves image quality and prompt alignment

→ Smaller datasets can achieve competitive results with optimized architecture

-----

Results 📊:

→ Achieved 12.7 FID score on COCO dataset

→ Total training cost: $1,890 (2.6 days on 8×H100 GPUs)

→ 118× lower cost than Stable Diffusion models

→ 14× cheaper than current state-of-art approach ($28,400)

Discussion about this video