Domain expertise meets visual AI through smart synthetic task generation.
Outperforms GPT-4V on synthetic tasks.
This paper introduces AdaMLLM, a method to adapt general Multimodal LLMs to specific domains like biomedicine and food. It proposes a single-stage training pipeline and a visual instruction synthesizer that generates domain-specific tasks from image-caption pairs, improving performance across various domain tasks.
-----
https://arxiv.org/abs/2411.19930
🔍 Original Problem:
General Multimodal LLMs struggle in specialized domains due to limited domain expertise and insufficient training data. Current approaches rely heavily on manual rules or closed-source models for generating domain-specific tasks.
-----
🛠️ Solution in this Paper:
→ Develops a visual instruction synthesizer that generates diverse tasks from domain-specific image-caption pairs using open-source models
→ Implements a consistency-based filter to ensure accuracy of synthetic instruction-response pairs
→ Introduces a single-stage training pipeline that combines synthetic tasks with image-caption pairs
→ Enhances task diversity during training by avoiding the limitations of traditional two-stage approaches
-----
💡 Key Insights:
→ Open-source models can effectively generate domain-specific tasks without relying on expensive closed-source models
→ Single-stage training prevents catastrophic forgetting of knowledge learned in earlier stages
→ Consistency-based filtering improves task accuracy while reducing expert annotation needs
-----
📊 Results:
→ Outperforms GPT-4V and manual rule-based systems across biomedicine and food domains
→ Shows consistent improvements across different model scales (2B to 11B parameters)
→ Achieves up to 81.3% accuracy on VQA-RAD medical tasks
Share this post