Vision Transformers power a simpler yet stronger normalizing flow model in the Jet architecture.
This paper revisits and simplifies coupling-based normalizing flow models for image generation, achieving strong performance with a less complex architecture. It uses Vision Transformer blocks instead of convolutional networks.
-----
https://arxiv.org/abs/2412.15129
Original Problem: 🤔:
→ Normalizing flow models for image generation, while having advantages like efficient log-likelihood computation and fast generation, previously underperformed in visual quality compared to other generative models.
→ The existing architectures were complex, involving components like multiscale architectures, invertible dense layers, and specialized normalization layers.
-----
Solution in this Paper: 💡:
→ The Jet model simplifies the architecture by using only affine coupling layers parameterized by a Vision Transformer. The input image is split into patches and processed through repeated applications of these coupling layers.
→ A single coupling layer computes an affine transform from one half of the input dimensions (patches or features) and applies it to the other half.
→ The model is trained by maximizing the data log-likelihood, with a dequantization procedure applied to handle discrete image data.
→ The inverse transformation for image generation is straightforward, computed by inverting the coupling layers. Initialization and channel splitting strategies are designed for stable and efficient training.
-----
Key Insights from this Paper: 🧠:
→ Vision Transformer blocks significantly outperform convolutional neural networks in the context of normalizing flows.
→ Simplifying the architecture by removing components like multiscale structures and invertible dense layers does not hurt performance, and can even improve it.
→ Transfer learning from larger datasets (ImageNet-21k) to smaller ones (ImageNet-1k, CIFAR-10) can improve performance and mitigate overfitting.
→ Interleaved spatial and channel coupling is optimal for performance. Scaling the number of coupling layers requires sufficiently deep Vision Transformer models (depth of 4-6) to maintain a favorable compute-performance trade-off.
-----
Results: 🎯:
→ Achieves state-of-the-art negative log-likelihood (3.580) on ImageNet-1k 64 × 64.
→ Shows strong performance on ImageNet-21k, with negative log-likelihood of 3.857 on 32 × 32 and 3.15 on 64 × 64.
→ Obtains state-of-the-art performance of 3.018 negative log-likelihood on CIFAR-10 after transfer learning.
Share this post