Scaling laws reveal optimal resource allocation for diffusion transformers in text-to-image synthesis.
And Power-law relationships govern diffusion transformer scaling
https://arxiv.org/abs/2410.08184
Original Problem 🔍:
Scaling laws for diffusion transformers in text-to-image generation were unexplored, hindering optimal resource allocation and performance prediction.
-----
Solution in this Paper 🧠:
• Conducted experiments across compute budgets from 1e17 to 6e18 FLOPs
• Established power-law relationships between compute, model size, data quantity, and loss
• Used Rectified Flow formulation with v-prediction and Logit-Normal timestep sampling
• Evaluated models on Laion5B subset and COCO validation datasets
• Analyzed scaling behavior of In-Context and Cross-Attention Transformers
-----
Key Insights from this Paper 💡:
• Optimal model size and data quantity scale with compute budget according to power laws
• Training loss and FID follow power-law relationships with compute
• Scaling laws hold for out-of-domain datasets, with consistent trends but vertical offsets
• Cross-Attention Transformers show more efficient performance improvement than In-Context Transformers
Share this post