0:00
/
0:00
Transcript

"Do Multimodal Large Language Models See Like Humans?"

The podcast on this paper is generated with Google's Illuminate.

Multimodal LLMs still can't see like humans - and this paper proves it with 85K tests

HVSBench evaluates whether multimodal LLMs perceive visual information like humans by testing their alignment with human visual systems across 85K diverse questions.

-----

https://arxiv.org/abs/2412.09603

🤔 Original Problem:

Current benchmarks can't effectively evaluate if multimodal LLMs process visual information similarly to human vision. This creates a gap in understanding whether AI models truly perceive images like humans do.

-----

🔍 Solution in this Paper:

→ Introduces HVSBench, a comprehensive benchmark with 85K multimodal question-answer pairs across 5 key visual processing fields

→ Tests models on Prominence (identifying visually striking objects), Subitizing (quick counting), Prioritizing (ranking object importance), Free-Viewing (natural attention shifts), and Searching (goal-directed visual search)

→ Implements robust evaluation protocols with automatic standardization to ensure fair comparison across different models

-----

💡 Key Insights:

→ Even top MLLMs show significant gaps in matching human visual perception

→ Adding human captions helps but doesn't fully bridge the perception gap

→ Larger model sizes generally lead to better alignment with human vision

-----

📊 Results:

→ Evaluated 13 leading MLLMs including GPT-4 and Gemini

→ Most models achieved only moderate results compared to human performance

→ Qwen2 achieved best overall performance, surpassing GPT-4 in several metrics

Discussion about this video

User's avatar