A model that speaks both sequential and diffusion languages for image generation.
CausalFusion unifies autoregressive and diffusion models by introducing dual-factorization across sequential tokens and noise levels, enabling flexible image generation with state-of-the-art quality .
-----
https://arxiv.org/abs/2412.12095
🤔 Original Problem:
→ Current image generation models either use autoregressive (AR) or diffusion approaches, but not both effectively. AR models excel at sequential generation while diffusion models are better at quality refinement .
-----
🔧 Solution in this Paper:
→ CausalFusion introduces a dual-factorization framework that combines AR and diffusion approaches in a single model .
→ The model can predict any number of tokens at any AR step, with any sequence order and inference compute level .
→ It uses a generalized causal attention mask to maintain proper dependencies across AR steps while ensuring each step only relies on clean tokens from previous steps .
→ The framework balances training difficulties across both AR and diffusion dimensions through exponential decay sampling and loss weighting .
-----
💡 Key Insights:
→ Proper balancing of task difficulties across AR and diffusion axes is crucial for model performance
→ Random token ordering outperforms fixed orders by preventing over-reliance on local features
→ The number of AR steps significantly impacts training signal distribution
-----
📊 Results:
→ Achieved state-of-the-art FID score of 1.77 on ImageNet class-conditional generation
→ Outperformed larger models like FiTv2-3B and Large-DiT-7B with fewer parameters
→ Enabled zero-shot image editing without task-specific training
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post