MMIE stress-tests AI models by throwing 20,000 mixed text-image tasks at them - they mostly failed.
Reveals Large Vision-Language Models (LVLMs) stumble when switching between understanding text and images.
📚 https://arxiv.org/abs/2410.10139
Original Problem 🔍:
Existing multimodal benchmarks lack comprehensive evaluation of interleaved text-and-image comprehension and generation capabilities in Large Vision-Language Models (LVLMs).
-----
Solution in this Paper 🛠️:
• MMIE: A large-scale benchmark with 20,103 multimodal queries across 3 categories, 12 fields, and 102 subfields
• Supports both interleaved inputs and outputs
• Includes multiple-choice and open-ended question formats
• Proposes an automated evaluation metric using a fine-tuned scoring model
-----
Key Insights from this Paper 💡:
• Interleaved multimodal comprehension and generation is crucial for next-generation LVLMs
• Current benchmarks are limited in scale, scope, and evaluation depth
• Automated evaluation metrics are needed to reduce bias and improve reliability
• LVLMs show significant room for improvement in interleaved multimodal tasks
-----
Results 📊:
• Best-performing model (GPT-4o + SDXL) achieved only 65.47% score
• Open-source interleaved LVLMs performed poorly, averaging 50.80%
• Integrated approaches outperformed interleaved LVLMs by 25.2% on average
Share this post