This survey analyzes Vision Language Models (VLMs), examining their architectures, evaluation methods, and real-world applications across robotics, healthcare, and web interfaces.
https://arxiv.org/abs/2501.02189
🔧 Methods explored in this Paper:
→ Provides a categorization of 38 VLM benchmarks across 10 categories, from visual reasoning to multilingual understanding. covers 5 years (2019-2024)
→ It analyzes three key VLM architectural approaches: training from scratch, using LLMs as backbones, and newer fusion techniques treating all modalities as tokens.
→ The research examines data collection methods for benchmarks, including human annotation, synthetic generation, and simulator-based approaches.
→ It explores VLM applications in robotics, healthcare, web agents, and autonomous driving.
-----
💡 Key Insights:
→ Most current state-of-the-art VLMs use pre-trained LLMs as backbones rather than training from scratch
→ Benchmark creation methods often sacrifice visual reasoning depth for dataset size
→ VLMs face challenges in hallucination, safety, and fairness across applications
→ Cross-attention mechanisms aren't universally necessary for strong VLM performance
Share this post