Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Transfusion, a great research from @AIatMeta .
Transfusion, a great research from @AIatMeta .
Predicts the Next Token and Diffuse Images with One Multi-Modal Model
Can generate images and text on a par with similar scale diffusion models and language models
Compresses each image to just 16 patches
So, Transfusion unifies text and image generation in a single model, rivaling specialized architectures.
Original Problem 🔍:
The challenge lies in unifying discrete sequence modeling (next token prediction) and continuous media generation (diffusion) within a single model capable of generating both text and images effectively.
Key Insights:
• Intra-image bidirectional attention significantly boosts performance
• U-Net encoding/decoding provides inductive bias benefits
• Transfusion scales efficiently with minimal parameter sharing cost
Solution in this Paper🛠️:
• Transfusion: A method to train a unified multi-modal model
• Uses separate loss functions for different modalities:
Language modeling loss for text
Diffusion loss for images
• Single transformer architecture processes both modalities
• Modality-specific lightweight components:
Text: Embedding matrices
Images: Linear layer or U-Net up/down blocks
• Transfusion attention: Combines causal and bidirectional attention
• Unified decoding algorithm for mixed-modal generation
Results📊:
• Outperforms Chameleon (discretization approach) in efficiency:
Text-to-image: 34× less compute for FID parity
Image-to-text: Matches performance at 21.8% of FLOPs
• 7B parameter model achieves:
FID score: 6.78 on MS-COCO
GenEval score: 0.63
• Comparable to specialized models:
Text generation: On par with Llama models
Image generation: Surpasses DALL-E 2, comparable to DeepFloyd