X-Prompt squeezes multiple images into compact tokens for better AI image generation
X-Prompt introduces a novel compression method for in-context image generation in auto-regressive vision language models. It efficiently compresses in-context examples into fixed-length tokens, enabling longer context sequences and better task understanding while reducing prohibitive training context length.
-----
https://arxiv.org/abs/2412.01824
🤔 Original Problem:
Auto-regressive vision language models struggle with in-context image generation due to excessive context length requirements. A single image needs 1024-4096 tokens, making it impractical to include multiple context images during training.
-----
🔧 Solution in this Paper:
→ X-Prompt compresses in-context examples into fixed-length tokens using three token types: In-Context Example Tokens, X-Prompt Tokens, and TODO Tokens.
→ The model uses attention masking to force reliance on X-Prompt Tokens for context representation.
→ A unified training approach combines text and image generation tasks to enhance understanding.
→ The Retrieval-Augmented Image Editing system automatically finds relevant examples from a database to guide editing tasks.
-----
💡 Key Insights:
→ Compressing context examples reduces training context length while preserving task information
→ Unified text-image training improves model's task interpretation abilities
→ Automated retrieval of similar examples enhances editing consistency
-----
📊 Results:
→ Improved text-to-image generation with +0.10 performance gain over baseline Chameleon
→ Achieved competitive results in dense prediction tasks like depth estimation (RMSE 0.277)
→ Enhanced image editing capabilities with CLIP_dir score of 0.097 on MagicBrush benchmark
Share this post