0:00
/
0:00
Transcript

"MetaMorph: Multimodal Understanding and Generation via Instruction Tuning"

Generated below podcast on this paper with Google's Illuminate.

Teaching LLMs to see and create with just a sprinkle of visual data

MetaMorph enables pretrained LLMs to both understand and generate visual content through a simple extension of visual instruction tuning called Visual-Predictive Instruction Tuning (VPiT).

-----

https://arxiv.org/abs/2412.14164

🤔 Original Problem:

Current unified multimodal models require extensive architectural changes and significant pretraining to handle both visual understanding and generation tasks.

-----

🔧 Solution in this Paper:

→ Visual-Predictive Instruction Tuning (VPiT) extends visual instruction tuning to predict both text and visual tokens autoregressively.

→ The model processes arbitrary sequences of images and text as input and generates both modalities using separate prediction heads.

→ Generated visual tokens are visualized through a finetuned diffusion model trained on vision encoder outputs.

→ The approach requires minimal additional training data - as little as 200k samples for effective visual generation.

-----

💡 Key Insights:

→ Visual generation emerges naturally from improved visual understanding

→ Understanding and generation abilities are mutually beneficial but asymmetrical

→ Visual understanding data contributes more significantly to both capabilities

→ Joint training with understanding data dramatically improves generation performance

-----

📊 Results:

→ Achieves competitive performance on both understanding and generation benchmarks

→ Outperforms other unified models while using significantly less training data

→ Successfully leverages LLM knowledge for specialized visual generation tasks

→ Demonstrates implicit reasoning capabilities in multimodal contexts

Discussion about this video