0:00
/
0:00
Transcript

"X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models"

The podcast on this paper is generated with Google's Illuminate.

X-Prompt squeezes multiple images into compact tokens for better AI image generation

X-Prompt introduces a novel compression method for in-context image generation in auto-regressive vision language models. It efficiently compresses in-context examples into fixed-length tokens, enabling longer context sequences and better task understanding while reducing prohibitive training context length.

-----

https://arxiv.org/abs/2412.01824

🤔 Original Problem:

Auto-regressive vision language models struggle with in-context image generation due to excessive context length requirements. A single image needs 1024-4096 tokens, making it impractical to include multiple context images during training.

-----

🔧 Solution in this Paper:

→ X-Prompt compresses in-context examples into fixed-length tokens using three token types: In-Context Example Tokens, X-Prompt Tokens, and TODO Tokens.

→ The model uses attention masking to force reliance on X-Prompt Tokens for context representation.

→ A unified training approach combines text and image generation tasks to enhance understanding.

→ The Retrieval-Augmented Image Editing system automatically finds relevant examples from a database to guide editing tasks.

-----

💡 Key Insights:

→ Compressing context examples reduces training context length while preserving task information

→ Unified text-image training improves model's task interpretation abilities

→ Automated retrieval of similar examples enhances editing consistency

-----

📊 Results:

→ Improved text-to-image generation with +0.10 performance gain over baseline Chameleon

→ Achieved competitive results in dense prediction tasks like depth estimation (RMSE 0.277)

→ Enhanced image editing capabilities with CLIP_dir score of 0.097 on MagicBrush benchmark

Discussion about this video