MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

A benchmark that tests if Vision-Language Models (VLMs) can truly understand and combine visual-text relationships like humans do.

Nov 07, 2024

A benchmark that tests if Vision-Language Models (VLMs) can truly understand and combine visual-text relationships like humans do

Making sure Vision-Language Models (VLMs) doesn't just memorize but actually understands image-text relationships.

🤖 Original Problem:

Current benchmarks lack comprehensive evaluation of VLMs' compositionality - the ability to combine known visual and textual elements in novel ways. Existing evaluations focus mainly on basic object-attribute relationships while missing deeper aspects like object interactions and counting.

🔧 Solution in this Paper:

MMCOMPOSITION introduces a novel benchmark with 13 distinct categories:

→ Tests perception tasks (attribute, object, counting, relation)

→ Evaluates reasoning capabilities (attribute, object, counting, relation)

→ Includes probing tasks for complex compositions

→ Features 4,342 human-annotated questions

→ Combines single/multi-image scenarios with varied choice formats

→ Uses rigorous data filtering with difficulty classification (easy to super hard)

🔍 Key Insights:

→ Visual encoder design critically impacts performance

→ Preserving image resolution improves results

→ Larger language decoders show better performance

→ Training data volume directly correlates with compositionality

→ GPT-4o underperforms on basic tasks despite larger model size

📊 Results:

→ Human experts: 90.31% accuracy

→ Best model (InternVL2-40B): 67.95% accuracy

→ GPT-4o: 59.71% accuracy

→ Random choice baseline: 30.15%

🎯 MMCOMPOSITION benchmark design

→ Contains 13 distinct categories covering perception tasks (attribute, object, counting, relation perception), reasoning tasks (attribute, object, counting, relation reasoning), and probing tasks

→ Includes both single-image and multi-image scenarios

→ Features both single-choice and indefinite-choice formats

→ Total of 4,342 high-quality human-annotated questions

Rohan's Bytes

Discussion about this post