MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models
A benchmark that tests if Vision-Language Models (VLMs) can truly understand and combine visual-text relationships like humans do.
A benchmark that tests if Vision-Language Models (VLMs) can truly understand and combine visual-text relationships like humans do
Making sure Vision-Language Models (VLMs) doesn't just memorize but actually understands image-text relationships.
🤖 Original Problem:
Current benchmarks lack comprehensive evaluation of VLMs' compositionality - the ability to combine known visual and textual elements in novel ways. Existing evaluations focus mainly on basic object-attribute relationships while missing deeper aspects like object interactions and counting.
🔧 Solution in this Paper:
MMCOMPOSITION introduces a novel benchmark with 13 distinct categories:
→ Tests perception tasks (attribute, object, counting, relation)
→ Evaluates reasoning capabilities (attribute, object, counting, relation)
→ Includes probing tasks for complex compositions
→ Features 4,342 human-annotated questions
→ Combines single/multi-image scenarios with varied choice formats
→ Uses rigorous data filtering with difficulty classification (easy to super hard)
🔍 Key Insights:
→ Visual encoder design critically impacts performance
→ Preserving image resolution improves results
→ Larger language decoders show better performance
→ Training data volume directly correlates with compositionality
→ GPT-4o underperforms on basic tasks despite larger model size
📊 Results:
→ Human experts: 90.31% accuracy
→ Best model (InternVL2-40B): 67.95% accuracy
→ GPT-4o: 59.71% accuracy
→ Random choice baseline: 30.15%
🎯 MMCOMPOSITION benchmark design
→ Contains 13 distinct categories covering perception tasks (attribute, object, counting, relation perception), reasoning tasks (attribute, object, counting, relation reasoning), and probing tasks
→ Includes both single-image and multi-image scenarios
→ Features both single-choice and indefinite-choice formats
→ Total of 4,342 high-quality human-annotated questions