Generate multiple related images without changing model architecture - just concatenate and tune
https://arxiv.org/abs/2410.23775
🎯 Original Problem:
Text-to-image models struggle with generating coherent sets of related images. Current solutions either require task-specific architectures or extensive computational resources, limiting their practical applications.
-----
🔧 Solution in this Paper:
→ Built a simple pipeline called In-Context LoRA that leverages inherent capabilities of text-to-image Diffusion Transformers
→ Concatenates multiple images into one large image instead of combining attention tokens
→ Uses joint captioning for multiple images with single merged prompt
→ Applies task-specific LoRA tuning using very small datasets (20-100 samples)
→ Training done on single A100 GPU for 5000 steps with batch size of 4
-----
💡 Key Insights:
→ Text-to-image models already have built-in capabilities for in-context generation
→ No need to modify original model architecture - only changes to training data required
→ Small high-quality datasets sufficient instead of large-scale training
→ Single unified prompt better than separate prompts per image
-----
📊 Results:
→ Successfully handles diverse tasks: storyboard generation, font design, portrait photography, visual identity design
→ Maintains high visual quality and consistency across generated image sets
→ Works in both reference-free (text-only) and reference-based (text+image) generation
→ Requires minimal computational resources compared to previous approaches
Share this post