Teaching LLMs to see and create with just a sprinkle of visual data
MetaMorph enables pretrained LLMs to both understand and generate visual content through a simple extension of visual instruction tuning called Visual-Predictive Instruction Tuning (VPiT).
-----
https://arxiv.org/abs/2412.14164
🤔 Original Problem:
Current unified multimodal models require extensive architectural changes and significant pretraining to handle both visual understanding and generation tasks.
-----
🔧 Solution in this Paper:
→ Visual-Predictive Instruction Tuning (VPiT) extends visual instruction tuning to predict both text and visual tokens autoregressively.
→ The model processes arbitrary sequences of images and text as input and generates both modalities using separate prediction heads.
→ Generated visual tokens are visualized through a finetuned diffusion model trained on vision encoder outputs.
→ The approach requires minimal additional training data - as little as 200k samples for effective visual generation.
-----
💡 Key Insights:
→ Visual generation emerges naturally from improved visual understanding
→ Understanding and generation abilities are mutually beneficial but asymmetrical
→ Visual understanding data contributes more significantly to both capabilities
→ Joint training with understanding data dramatically improves generation performance
-----
📊 Results:
→ Achieves competitive performance on both understanding and generation benchmarks
→ Outperforms other unified models while using significantly less training data
→ Successfully leverages LLM knowledge for specialized visual generation tasks
→ Demonstrates implicit reasoning capabilities in multimodal contexts
Share this post