"MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.00698
The paper addresses the problem that current benchmarks fail to properly evaluate abstract reasoning in Multimodal LLMs (MLLMs). Existing benchmarks focus on task-specific skills, not core cognitive abilities like abstraction and reasoning.
This paper introduces MM-IQ, a new benchmark to evaluate abstract visual reasoning in MLLMs. MM-IQ aims to overcome the limitations of current evaluation methods.
-----
📌 MM-IQ effectively isolates abstract visual reasoning, revealing current Multimodal LLMs' (MLLMs) deficiency in core cognitive skills, beyond task-specific abilities. This benchmark pinpoints a critical gap in current AI.
📌 The MM-IQ benchmark highlights a stark performance disparity: state-of-the-art MLLMs barely surpass random chance (27.49% accuracy), significantly lagging human performance (51.27%). This quantifies the abstraction ability gap.
📌 By categorizing errors into reasoning, visual understanding, and answer selection, MM-IQ provides a detailed diagnostic tool. This error analysis directs focused improvements in MLLM architecture and training methodologies.
----------
Methods Explored in this Paper 🔧:
→ The paper introduces MM-IQ, a new benchmark dataset for evaluating abstract visual reasoning in MLLMs.
→ MM-IQ contains 2,710 meticulously curated problems.
→ These problems span 8 distinct reasoning paradigms.
→ The data is collected from professional civil service examinations.
→ Rigorous quality control was performed by human annotators to ensure high quality and relevance to abstract reasoning.
→ Problems were classified into 8 reasoning paradigms including logical operation, mathematics, 2D and 3D geometry, visual instruction, temporal movement, spatial relationship, and concrete object.
→ The dataset creation process involved data collection, paradigm classification, data cleaning, and translation into English from the original Chinese questions.
→ This benchmark is designed to be free from linguistic and domain-specific biases, focusing purely on abstract visual reasoning.
-----
Key Insights 💡:
→ Current state-of-the-art MLLMs show significant limitations in abstract visual reasoning.
→ Even top models achieve only marginally better performance than random chance on MM-IQ.
→ There is a substantial performance gap between humans and current MLLMs in abstract visual reasoning tasks.
→ This performance gap highlights the need for advancements in MLLM architectures to bridge the cognitive divide in abstract reasoning.
→ Open-source MLLMs are approaching the performance levels of proprietary models, indicating the power of community driven development.
→ Logical operation is identified as the most challenging reasoning paradigm for MLLMs within the MM-IQ benchmark.
-----
Results 📊:
→ Top performing model Claude-3.5-Sonnet achieves 27.49% accuracy on MM-IQ.
→ This is only slightly above the random chance baseline of 25%.
→ Human performance on MM-IQ is significantly higher at 51.27% accuracy.
→ Open-source models like Qwen2-VL-72B-Instruct and QVQ-72B-Preview achieve around 26% accuracy.