0:00
/
0:00
Transcript

"LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation"

The podcast on this paper is generated with Google's Illuminate.

LLM2CLIP makes LLMs teach CLIP how to see the world better

Bridging the gap between vision and language using LLMs as teachers

https://arxiv.org/abs/2411.04997

🎯 Original Problem:

CLIP, while powerful for multimodal tasks, has limitations in processing long and complex text due to its simple text encoder. Modern LLMs have advanced language capabilities but can't be directly used to improve CLIP due to their poor feature discriminability.

-----

🛠️ Solution in this Paper:

→ LLM2CLIP transforms LLMs into effective CLIP text encoders through caption contrastive fine-tuning

→ Uses LoRA to efficiently fine-tune LLM output features for better caption discrimination

→ Freezes LLM gradients during training to preserve knowledge while adding adapter layers

→ Pre-extracts text features to maintain computational efficiency similar to regular CLIP

-----

💡 Key Insights:

→ Native LLM features have poor discriminability (18.4% caption retrieval accuracy)

→ Caption contrastive fine-tuning boosts LLM discriminability to 73% accuracy

→ Freezing LLM gradients preserves knowledge while reducing computational costs

→ Adding adapter layers enables effective vision-language alignment

-----

📊 Results:

→ Improved previous SOTA EVA02 model by 16.5% on text retrieval tasks

→ Transformed English-only CLIP into state-of-the-art cross-lingual model

→ Enhanced performance when integrated with multimodal models like Llava 1.5

→ Maintained training costs similar to regular CLIP fine-tuning

Discussion about this video