0:00
/
0:00
Transcript

"CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering"

Generated below podcast on this paper with Google's Illuminate.

CLIP-UP enables Vision-Language Models to identify and handle unanswerable visual questions through a lightweight training approach that preserves original model capabilities.

-----

https://arxiv.org/abs/2501.01371

🤔 Original Problem:

Vision-Language Models often provide wrong answers to unanswerable visual questions, like asking about objects not present in images, reducing their reliability.

-----

🔧 Solution in this Paper:

→ CLIP-UP extracts question-image alignment information through correlation vectors using CLIP embeddings

→ It projects these vectors into the model's feature space using lightweight trainable layers

→ A Mixture-of-Experts approach handles different types of unanswerable questions through specialized projections

→ The method only trains projection layers while keeping original model weights frozen

→ Simple rule-based classification determines when to activate the new embedding vector

-----

💡 Key Insights:

→ Global alignment information from CLIP helps detect unanswerability

→ Lightweight training can effectively add new capabilities without full model fine-tuning

→ Different unanswerability types benefit from specialized expert projections

-----

📊 Results:

→ Achieves state-of-the-art results on MM-UPD benchmark

→ Maintains original model performance on standard tasks

→ Requires only 12.6M parameters for projection layers

Discussion about this video