Task representations in VLMs are modality-agnostic and transferable.
i.e. VLMs encode tasks in a shared vector space that works across text and images
📚 https://arxiv.org/abs/2410.22330
🤔 Original Problem:
Vision-and-Language Models (VLMs) can handle various tasks through text, but we don't understand how they internally process and represent these tasks across different modalities.
-----
🛠️ Solution in this Paper:
→ Discovered that VLMs map inputs to abstract task representations in a shared embedding space
→ These task vectors remain consistent whether specified through text examples, image examples, or instructions
→ Tokens evolve through three phases during answer generation:
- Input phase: literal input representation
- Task phase: task-specific transformation
- Answer phase: final output convergence
→ Enabled cross-modal transfer of task vectors between:
- Text to image queries
- Image to text queries
- Base LLMs to fine-tuned VLMs
-----
💡 Key Insights:
→ Task vectors emerge in a shared space where similar tasks cluster together regardless of modality
→ The process of generating answers follows same pattern across modalities
→ Task vectors can be specified through examples or instructions and combined for better results
→ Cross-modal transfer allows more flexible task specification options
-----
📊 Results:
→ Text-to-image transfer improved performance by 33% over text examples
→ Ensembling text instructions with examples showed 18% better performance
→ Task vectors maintained 89-95% similarity between base LLM and fine-tuned VLM
Share this post