0:00
/
0:00
Transcript

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

The podcast on this paper is generated with Google's Illuminate.

MMIE stress-tests AI models by throwing 20,000 mixed text-image tasks at them - they mostly failed.

Reveals Large Vision-Language Models (LVLMs) stumble when switching between understanding text and images.

📚 https://arxiv.org/abs/2410.10139

Original Problem 🔍:

Existing multimodal benchmarks lack comprehensive evaluation of interleaved text-and-image comprehension and generation capabilities in Large Vision-Language Models (LVLMs).

-----

Solution in this Paper 🛠️:

• MMIE: A large-scale benchmark with 20,103 multimodal queries across 3 categories, 12 fields, and 102 subfields

• Supports both interleaved inputs and outputs

• Includes multiple-choice and open-ended question formats

• Proposes an automated evaluation metric using a fine-tuned scoring model

-----

Key Insights from this Paper 💡:

• Interleaved multimodal comprehension and generation is crucial for next-generation LVLMs

• Current benchmarks are limited in scale, scope, and evaluation depth

• Automated evaluation metrics are needed to reduce bias and improve reliability

• LVLMs show significant room for improvement in interleaved multimodal tasks

-----

Results 📊:

• Best-performing model (GPT-4o + SDXL) achieved only 65.47% score

• Open-source interleaved LVLMs performed poorly, averaging 50.80%

• Integrated approaches outperformed interleaved LVLMs by 25.2% on average

Discussion about this video