"Do Multimodal Large Language Models See Like Humans?"

Playback speed

Share post at current time

0:00

Transcript

"Do Multimodal Large Language Models See Like Humans?"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 24, 2024

Multimodal LLMs still can't see like humans - and this paper proves it with 85K tests

HVSBench evaluates whether multimodal LLMs perceive visual information like humans by testing their alignment with human visual systems across 85K diverse questions.

-----

https://arxiv.org/abs/2412.09603

🤔 Original Problem:

Current benchmarks can't effectively evaluate if multimodal LLMs process visual information similarly to human vision. This creates a gap in understanding whether AI models truly perceive images like humans do.

-----

🔍 Solution in this Paper:

→ Introduces HVSBench, a comprehensive benchmark with 85K multimodal question-answer pairs across 5 key visual processing fields

→ Tests models on Prominence (identifying visually striking objects), Subitizing (quick counting), Prioritizing (ranking object importance), Free-Viewing (natural attention shifts), and Searching (goal-directed visual search)

→ Implements robust evaluation protocols with automatic standardization to ensure fair comparison across different models

-----

💡 Key Insights:

→ Even top MLLMs show significant gaps in matching human visual perception

→ Adding human captions helps but doesn't fully bridge the perception gap

→ Larger model sizes generally lead to better alignment with human vision

-----

📊 Results:

→ Evaluated 13 leading MLLMs including GPT-4 and Gemini

→ Most models achieved only moderate results compared to human performance

→ Qwen2 achieved best overall performance, surpassing GPT-4 in several metrics