CLIP-UP enables Vision-Language Models to identify and handle unanswerable visual questions through a lightweight training approach that preserves original model capabilities.
-----
https://arxiv.org/abs/2501.01371
🤔 Original Problem:
Vision-Language Models often provide wrong answers to unanswerable visual questions, like asking about objects not present in images, reducing their reliability.
-----
🔧 Solution in this Paper:
→ CLIP-UP extracts question-image alignment information through correlation vectors using CLIP embeddings
→ It projects these vectors into the model's feature space using lightweight trainable layers
→ A Mixture-of-Experts approach handles different types of unanswerable questions through specialized projections
→ The method only trains projection layers while keeping original model weights frozen
→ Simple rule-based classification determines when to activate the new embedding vector
-----
💡 Key Insights:
→ Global alignment information from CLIP helps detect unanswerability
→ Lightweight training can effectively add new capabilities without full model fine-tuning
→ Different unanswerability types benefit from specialized expert projections
-----
📊 Results:
→ Achieves state-of-the-art results on MM-UPD benchmark
→ Maintains original model performance on standard tasks
→ Requires only 12.6M parameters for projection layers
Share this post