0:00
/
0:00
Transcript

"VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information"

The podcast on this paper is generated with Google's Illuminate.

Current AI models are essentially geometry-blind, scoring just above random chance

VisOnlyQA dataset reveals that current LLMs struggle with basic visual perception tasks, achieving only 51-54% accuracy on geometric shape recognition compared to 93.5% human performance.

-----

https://arxiv.org/abs/2412.00947

🔍 Original Problem:

LLMs show poor visual perception capabilities, especially in understanding geometric and numerical information in scientific figures. Existing benchmarks don't effectively isolate and evaluate these fundamental visual skills.

-----

🛠️ Solution in this Paper:

→ Created VisOnlyQA, a specialized dataset with 1,200 multiple-choice questions across 12 tasks focusing on geometric shapes, chemical structures, charts, and 3D shapes

→ Developed 70,000 synthetic training instances to help improve visual perception capabilities

→ Designed questions that specifically test visual perception without requiring reasoning or domain knowledge

→ Implemented both real and synthetic figures to ensure natural distribution and comprehensive evaluation

-----

💡 Key Insights:

→ Even top models like GPT-4o and Gemini 1.5 Pro perform near randomly on basic shape recognition

→ Chain-of-thought reasoning doesn't consistently improve visual perception performance

→ Language model size significantly impacts visual perception capabilities

→ Fine-tuning helps but improvements are limited to specific tasks and models

-----

📊 Results:

→ State-of-the-art models achieve only 51.4% (GPT-4o) and 54.2% (Gemini 1.5 Pro) accuracy

→ Human performance reaches 93.5% accuracy

→ Fine-tuning improves performance on specific tasks but fails to generalize across all visual tasks

Discussion about this video