"MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 11, 2025

Article voiceover

0:00

-4:38

https://arxiv.org/abs/2502.00698

The paper addresses the problem that current benchmarks fail to properly evaluate abstract reasoning in Multimodal LLMs (MLLMs). Existing benchmarks focus on task-specific skills, not core cognitive abilities like abstraction and reasoning.

This paper introduces MM-IQ, a new benchmark to evaluate abstract visual reasoning in MLLMs. MM-IQ aims to overcome the limitations of current evaluation methods.

-----

📌 MM-IQ effectively isolates abstract visual reasoning, revealing current Multimodal LLMs' (MLLMs) deficiency in core cognitive skills, beyond task-specific abilities. This benchmark pinpoints a critical gap in current AI.

📌 The MM-IQ benchmark highlights a stark performance disparity: state-of-the-art MLLMs barely surpass random chance (27.49% accuracy), significantly lagging human performance (51.27%). This quantifies the abstraction ability gap.

📌 By categorizing errors into reasoning, visual understanding, and answer selection, MM-IQ provides a detailed diagnostic tool. This error analysis directs focused improvements in MLLM architecture and training methodologies.

----------

Methods Explored in this Paper 🔧:

→ The paper introduces MM-IQ, a new benchmark dataset for evaluating abstract visual reasoning in MLLMs.

→ MM-IQ contains 2,710 meticulously curated problems.

→ These problems span 8 distinct reasoning paradigms.

→ The data is collected from professional civil service examinations.

→ Rigorous quality control was performed by human annotators to ensure high quality and relevance to abstract reasoning.

→ Problems were classified into 8 reasoning paradigms including logical operation, mathematics, 2D and 3D geometry, visual instruction, temporal movement, spatial relationship, and concrete object.

→ The dataset creation process involved data collection, paradigm classification, data cleaning, and translation into English from the original Chinese questions.

→ This benchmark is designed to be free from linguistic and domain-specific biases, focusing purely on abstract visual reasoning.

-----

Key Insights 💡:

→ Current state-of-the-art MLLMs show significant limitations in abstract visual reasoning.

→ Even top models achieve only marginally better performance than random chance on MM-IQ.

→ There is a substantial performance gap between humans and current MLLMs in abstract visual reasoning tasks.

→ This performance gap highlights the need for advancements in MLLM architectures to bridge the cognitive divide in abstract reasoning.

→ Open-source MLLMs are approaching the performance levels of proprietary models, indicating the power of community driven development.

→ Logical operation is identified as the most challenging reasoning paradigm for MLLMs within the MM-IQ benchmark.

-----

Results 📊:

→ Top performing model Claude-3.5-Sonnet achieves 27.49% accuracy on MM-IQ.

→ This is only slightly above the random chance baseline of 25%.

→ Human performance on MM-IQ is significantly higher at 51.27% accuracy.

→ Open-source models like Qwen2-VL-72B-Instruct and QVQ-72B-Preview achieve around 26% accuracy.

Rohan's Bytes

Discussion about this post