"MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.00698
The paper addresses the problem that current benchmarks fail to properly evaluate abstract reasoning in Multimodal LLMs (MLLMs). Existing benchmarks focus on task-specific skills, not core cognitive abilities like abstraction and reasoning.
This paper introduces MM-IQ, a new benchmark to evaluate abstract visual reasoning in MLLMs. MM-IQ aims to overcome the limitations of current evaluation methods.
-----
๐ MM-IQ effectively isolates abstract visual reasoning, revealing current Multimodal LLMs' (MLLMs) deficiency in core cognitive skills, beyond task-specific abilities. This benchmark pinpoints a critical gap in current AI.
๐ The MM-IQ benchmark highlights a stark performance disparity: state-of-the-art MLLMs barely surpass random chance (27.49% accuracy), significantly lagging human performance (51.27%). This quantifies the abstraction ability gap.
๐ By categorizing errors into reasoning, visual understanding, and answer selection, MM-IQ provides a detailed diagnostic tool. This error analysis directs focused improvements in MLLM architecture and training methodologies.
----------
Methods Explored in this Paper ๐ง:
โ The paper introduces MM-IQ, a new benchmark dataset for evaluating abstract visual reasoning in MLLMs.
โ MM-IQ contains 2,710 meticulously curated problems.
โ These problems span 8 distinct reasoning paradigms.
โ The data is collected from professional civil service examinations.
โ Rigorous quality control was performed by human annotators to ensure high quality and relevance to abstract reasoning.
โ Problems were classified into 8 reasoning paradigms including logical operation, mathematics, 2D and 3D geometry, visual instruction, temporal movement, spatial relationship, and concrete object.
โ The dataset creation process involved data collection, paradigm classification, data cleaning, and translation into English from the original Chinese questions.
โ This benchmark is designed to be free from linguistic and domain-specific biases, focusing purely on abstract visual reasoning.
-----
Key Insights ๐ก:
โ Current state-of-the-art MLLMs show significant limitations in abstract visual reasoning.
โ Even top models achieve only marginally better performance than random chance on MM-IQ.
โ There is a substantial performance gap between humans and current MLLMs in abstract visual reasoning tasks.
โ This performance gap highlights the need for advancements in MLLM architectures to bridge the cognitive divide in abstract reasoning.
โ Open-source MLLMs are approaching the performance levels of proprietary models, indicating the power of community driven development.
โ Logical operation is identified as the most challenging reasoning paradigm for MLLMs within the MM-IQ benchmark.
-----
Results ๐:
โ Top performing model Claude-3.5-Sonnet achieves 27.49% accuracy on MM-IQ.
โ This is only slightly above the random chance baseline of 25%.
โ Human performance on MM-IQ is significantly higher at 51.27% accuracy.
โ Open-source models like Qwen2-VL-72B-Instruct and QVQ-72B-Preview achieve around 26% accuracy.


