0:00
/
0:00
Transcript

"Task Vectors are Cross-Modal"

The podcast on this paper is generated with Google's Illuminate.

Task representations in VLMs are modality-agnostic and transferable.

i.e. VLMs encode tasks in a shared vector space that works across text and images

📚 https://arxiv.org/abs/2410.22330

🤔 Original Problem:

Vision-and-Language Models (VLMs) can handle various tasks through text, but we don't understand how they internally process and represent these tasks across different modalities.

-----

🛠️ Solution in this Paper:

→ Discovered that VLMs map inputs to abstract task representations in a shared embedding space

→ These task vectors remain consistent whether specified through text examples, image examples, or instructions

→ Tokens evolve through three phases during answer generation:

- Input phase: literal input representation

- Task phase: task-specific transformation

- Answer phase: final output convergence

→ Enabled cross-modal transfer of task vectors between:

- Text to image queries

- Image to text queries

- Base LLMs to fine-tuned VLMs

-----

💡 Key Insights:

→ Task vectors emerge in a shared space where similar tasks cluster together regardless of modality

→ The process of generating answers follows same pattern across modalities

→ Task vectors can be specified through examples or instructions and combined for better results

→ Cross-modal transfer allows more flexible task specification options

-----

📊 Results:

→ Text-to-image transfer improved performance by 33% over text examples

→ Ensembling text instructions with examples showed 18% better performance

→ Task vectors maintained 89-95% similarity between base LLM and fine-tuned VLM

Discussion about this video

User's avatar