"VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information"

Playback speed

Share post at current time

0:00

Transcript

"VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Jan 05, 2025

Current AI models are essentially geometry-blind, scoring just above random chance

VisOnlyQA dataset reveals that current LLMs struggle with basic visual perception tasks, achieving only 51-54% accuracy on geometric shape recognition compared to 93.5% human performance.

-----

https://arxiv.org/abs/2412.00947

🔍 Original Problem:

LLMs show poor visual perception capabilities, especially in understanding geometric and numerical information in scientific figures. Existing benchmarks don't effectively isolate and evaluate these fundamental visual skills.

-----

🛠️ Solution in this Paper:

→ Created VisOnlyQA, a specialized dataset with 1,200 multiple-choice questions across 12 tasks focusing on geometric shapes, chemical structures, charts, and 3D shapes

→ Developed 70,000 synthetic training instances to help improve visual perception capabilities

→ Designed questions that specifically test visual perception without requiring reasoning or domain knowledge

→ Implemented both real and synthetic figures to ensure natural distribution and comprehensive evaluation

-----

💡 Key Insights:

→ Even top models like GPT-4o and Gemini 1.5 Pro perform near randomly on basic shape recognition

→ Chain-of-thought reasoning doesn't consistently improve visual perception performance

→ Language model size significantly impacts visual perception capabilities

→ Fine-tuning helps but improvements are limited to specific tasks and models

-----

📊 Results:

→ State-of-the-art models achieve only 51.4% (GPT-4o) and 54.2% (Gemini 1.5 Pro) accuracy

→ Human performance reaches 93.5% accuracy

→ Fine-tuning improves performance on specific tasks but fails to generalize across all visual tasks

Rohan's Bytes

"VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information"

Discussion about this video