From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

A framework that transforms generic vision models into medical experts without losing their versatility

Nov 10, 2024

A framework that transforms generic vision models into medical experts without losing their versatility

Original Problem 🤔:

Vision Language Models (VLMs) underperform in task-specific scenarios due to domain gaps between pretraining and fine-tuning. This limits their effectiveness in specialized applications like medical diagnosis.

Solution in this Paper 🛠️:

Exemplar Prompting: Uses Task-Specific Model (TSM) features to guide VLMs, enhancing adaptability without altering pre-trained features.
Response Distribution Alignment: Aligns response distributions between exemplar-prompted and non-exemplar-prompted models, enabling VLMs to learn from TSMs implicitly.
Contrastive Response Tuning: Optimizes response ranking by maximizing the margin between correct and incorrect image-response pairs.

Key Insights from this Paper 💡:

VITask bridges the gap between VLMs and TSMs, enhancing task-specific performance.
Exemplar Prompting improves VLM adaptability using TSM features.
Response Distribution Alignment allows learning without TSMs during inference.
Contrastive Response Tuning refines response accuracy and ranking.

Results 📊:

VITask outperforms vanilla instruction-tuned VLMs and TSMs across 12 medical datasets.
Accuracy improvements: PathMNIST (0.953), OCTMNIST (0.952), DermaMNIST (0.877).
F1 score improvements: DermaMNIST (0.772), RetinaMNIST (0.522).
Demonstrates robustness to incomplete instructions and flexibility in integrating different TSM architectures.

Rohan's Bytes

Discussion about this post