0:00
/
0:00
Transcript

"Causal Diffusion Transformers for Generative Modeling"

Generated below podcast on this paper with Google's Illuminate.

A model that speaks both sequential and diffusion languages for image generation.

CausalFusion unifies autoregressive and diffusion models by introducing dual-factorization across sequential tokens and noise levels, enabling flexible image generation with state-of-the-art quality .

-----

https://arxiv.org/abs/2412.12095

🤔 Original Problem:

→ Current image generation models either use autoregressive (AR) or diffusion approaches, but not both effectively. AR models excel at sequential generation while diffusion models are better at quality refinement .

-----

🔧 Solution in this Paper:

→ CausalFusion introduces a dual-factorization framework that combines AR and diffusion approaches in a single model .

→ The model can predict any number of tokens at any AR step, with any sequence order and inference compute level .

→ It uses a generalized causal attention mask to maintain proper dependencies across AR steps while ensuring each step only relies on clean tokens from previous steps .

→ The framework balances training difficulties across both AR and diffusion dimensions through exponential decay sampling and loss weighting .

-----

💡 Key Insights:

→ Proper balancing of task difficulties across AR and diffusion axes is crucial for model performance

→ Random token ordering outperforms fixed orders by preventing over-reliance on local features

→ The number of AR steps significantly impacts training signal distribution

-----

📊 Results:

→ Achieved state-of-the-art FID score of 1.77 on ImageNet class-conditional generation

→ Outperformed larger models like FiTv2-3B and Large-DiT-7B with fewer parameters

→ Enabled zero-shot image editing without task-specific training

------

Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓

🎉 https://rohanpaul.substack.com/

Discussion about this video

User's avatar