Multimodal LLMs still can't see like humans - and this paper proves it with 85K tests
HVSBench evaluates whether multimodal LLMs perceive visual information like humans by testing their alignment with human visual systems across 85K diverse questions.
-----
https://arxiv.org/abs/2412.09603
🤔 Original Problem:
Current benchmarks can't effectively evaluate if multimodal LLMs process visual information similarly to human vision. This creates a gap in understanding whether AI models truly perceive images like humans do.
-----
🔍 Solution in this Paper:
→ Introduces HVSBench, a comprehensive benchmark with 85K multimodal question-answer pairs across 5 key visual processing fields
→ Tests models on Prominence (identifying visually striking objects), Subitizing (quick counting), Prioritizing (ranking object importance), Free-Viewing (natural attention shifts), and Searching (goal-directed visual search)
→ Implements robust evaluation protocols with automatic standardization to ensure fair comparison across different models
-----
💡 Key Insights:
→ Even top MLLMs show significant gaps in matching human visual perception
→ Adding human captions helps but doesn't fully bridge the perception gap
→ Larger model sizes generally lead to better alignment with human vision
-----
📊 Results:
→ Evaluated 13 leading MLLMs including GPT-4 and Gemini
→ Most models achieved only moderate results compared to human performance
→ Qwen2 achieved best overall performance, surpassing GPT-4 in several metrics
Share this post