MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Updated more advanced MMMU-Pro exposes limitations in current multimodal models through enhanced evaluation techniques.

Nov 09, 2024

Updated more advanced MMMU-Pro exposes limitations in current multimodal models through enhanced evaluation techniques.

Model performance drops 16.8% to 26.9% on MMMU-Pro compared to MMMU

Original Problem 🔍:

Existing multimodal benchmarks like MMMU allow text-only models to exploit shortcuts, limiting true evaluation of multimodal understanding and reasoning capabilities.

Key Insights from this Paper 💡:

• Text-only LLMs can solve some MMMU questions without visual input

• Limited options enable guessing strategies

• True multimodal understanding requires seamless integration of visual and textual information

• Current models struggle with vision-only input scenarios

Solution in this Paper 🛠️:

• Filtered out text-only solvable questions using strong LLMs

• Augmented options from 4 to 10 to reduce guessing effectiveness

• Introduced vision-only input setting with questions embedded in images

• Evaluated models across standard and vision-only settings

• Explored impact of OCR prompts and Chain of Thought (CoT) reasoning

Results 📊:

• GPT-4o (0513) achieves 51.9% on MMMU-Pro vs 69.1% on MMMU

• OCR prompts show minimal impact on performance

• CoT generally improves results, but effects vary by model

• Vision-only setting poses significant challenges (e.g., LLaVA-OneVision-72B drops 14.0%)

MMMU-Pro employs a rigorous three-step construction process (as shown in below image) that builds upon MMMU

(1) filtering out questions answerable by text-only language models,

(2) augmenting candidate options to reduce the effectiveness of guessing based on the options, and

(3) introducing a vision-only input setting (as shown in Figure 3) where models are presented with questions embedded in a screenshot or photo.

Rohan's Bytes

Discussion about this post