Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Transfusion, a great research from @AIatMeta .

Rohan Paul

Nov 10, 2024

Transfusion, a great research from @AIatMeta .

Predicts the Next Token and Diffuse Images with One Multi-Modal Model

Can generate images and text on a par with similar scale diffusion models and language models
Compresses each image to just 16 patches

So, Transfusion unifies text and image generation in a single model, rivaling specialized architectures.

Original Problem 🔍:

The challenge lies in unifying discrete sequence modeling (next token prediction) and continuous media generation (diffusion) within a single model capable of generating both text and images effectively.

Key Insights:

• Intra-image bidirectional attention significantly boosts performance

• U-Net encoding/decoding provides inductive bias benefits

• Transfusion scales efficiently with minimal parameter sharing cost

Solution in this Paper🛠️:

• Transfusion: A method to train a unified multi-modal model

• Uses separate loss functions for different modalities:

Language modeling loss for text
Diffusion loss for images

• Single transformer architecture processes both modalities

• Modality-specific lightweight components:

Text: Embedding matrices
Images: Linear layer or U-Net up/down blocks

• Transfusion attention: Combines causal and bidirectional attention

• Unified decoding algorithm for mixed-modal generation

Results📊:

• Outperforms Chameleon (discretization approach) in efficiency:

Text-to-image: 34× less compute for FID parity
Image-to-text: Matches performance at 21.8% of FLOPs

• 7B parameter model achieves:

FID score: 6.78 on MS-COCO
GenEval score: 0.63

• Comparable to specialized models:

Text generation: On par with Llama models
Image generation: Surpasses DALL-E 2, comparable to DeepFloyd

Rohan's Bytes

Discussion about this post