Current AI models are essentially geometry-blind, scoring just above random chance
VisOnlyQA dataset reveals that current LLMs struggle with basic visual perception tasks, achieving only 51-54% accuracy on geometric shape recognition compared to 93.5% human performance.
-----
https://arxiv.org/abs/2412.00947
🔍 Original Problem:
LLMs show poor visual perception capabilities, especially in understanding geometric and numerical information in scientific figures. Existing benchmarks don't effectively isolate and evaluate these fundamental visual skills.
-----
🛠️ Solution in this Paper:
→ Created VisOnlyQA, a specialized dataset with 1,200 multiple-choice questions across 12 tasks focusing on geometric shapes, chemical structures, charts, and 3D shapes
→ Developed 70,000 synthetic training instances to help improve visual perception capabilities
→ Designed questions that specifically test visual perception without requiring reasoning or domain knowledge
→ Implemented both real and synthetic figures to ensure natural distribution and comprehensive evaluation
-----
💡 Key Insights:
→ Even top models like GPT-4o and Gemini 1.5 Pro perform near randomly on basic shape recognition
→ Chain-of-thought reasoning doesn't consistently improve visual perception performance
→ Language model size significantly impacts visual perception capabilities
→ Fine-tuning helps but improvements are limited to specific tasks and models
-----
📊 Results:
→ State-of-the-art models achieve only 51.4% (GPT-4o) and 54.2% (Gemini 1.5 Pro) accuracy
→ Human performance reaches 93.5% accuracy
→ Fine-tuning improves performance on specific tasks but fails to generalize across all visual tasks
Share this post