MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models
A benchmark that tests if Vision-Language Models (VLMs) can truly understand and combine visual-text relationships like humans do.
A benchmark that tests if Vision-Language Models (VLMs) can truly understand and combine visual-text relationships like humans do
Making sure Vision-Language Models (VLMs) doesn't just memorize but actually understands image-text relationships.
๐ค Original Problem:
Current benchmarks lack comprehensive evaluation of VLMs' compositionality - the ability to combine known visual and textual elements in novel ways. Existing evaluations focus mainly on basic object-attribute relationships while missing deeper aspects like object interactions and counting.
๐ง Solution in this Paper:
MMCOMPOSITION introduces a novel benchmark with 13 distinct categories:
โ Tests perception tasks (attribute, object, counting, relation)
โ Evaluates reasoning capabilities (attribute, object, counting, relation)
โ Includes probing tasks for complex compositions
โ Features 4,342 human-annotated questions
โ Combines single/multi-image scenarios with varied choice formats
โ Uses rigorous data filtering with difficulty classification (easy to super hard)
๐ Key Insights:
โ Visual encoder design critically impacts performance
โ Preserving image resolution improves results
โ Larger language decoders show better performance
โ Training data volume directly correlates with compositionality
โ GPT-4o underperforms on basic tasks despite larger model size
๐ Results:
โ Human experts: 90.31% accuracy
โ Best model (InternVL2-40B): 67.95% accuracy
โ GPT-4o: 59.71% accuracy
โ Random choice baseline: 30.15%
๐ฏ MMCOMPOSITION benchmark design
โ Contains 13 distinct categories covering perception tasks (attribute, object, counting, relation perception), reasoning tasks (attribute, object, counting, relation reasoning), and probing tasks
โ Includes both single-image and multi-image scenarios
โ Features both single-choice and indefinite-choice formats
โ Total of 4,342 high-quality human-annotated questions



