Teaching language models to handle images without messing up their text abilities
LMFusion enables text-only LLMs to understand and generate both text and images while preserving their original language capabilities through modality-specific processing.
-----
https://arxiv.org/abs/2412.15188
🤔 Original Problem:
→ Training multimodal models from scratch requires massive computational resources and often leads to suboptimal performance in language tasks
→ Simply fine-tuning existing LLMs for multimodal tasks significantly degrades their language capabilities
-----
🔧 Solution in this Paper:
→ LMFusion uses separate processing paths for text and images while allowing cross-modal interactions through shared attention layers
→ Text data flows through frozen Llama-3 modules to preserve language abilities
→ Image data is processed through parallel transformer modules trained specifically for visual tasks
→ A special BOI token separates different modalities in the sequence
→ The architecture employs modality-specific query-key-value projections and feed-forward networks
-----
💡 Key Insights:
→ Deep modality separation outperforms shallow separation in preserving model capabilities
→ Freezing text modules while training image modules prevents catastrophic forgetting
→ The approach can be extended to existing vision-language models
→ Modular design enables parallel development of language and vision capabilities
-----
📊 Results:
→ Improves image understanding by 20% compared to baseline methods
→ Enhances image generation quality by 3.6%
→ Achieves these improvements while using only 50% of the FLOPs
→ Maintains Llama-3's original language performance
------
Are you into AI and LLMs❓ Join me on X/Twitter with 52K+ others, to remain on the bleeding-edge of AI every day.
𝕏/🐦 https://x.com/rohanpaul_ai
Share this post