VSI-Bench evaluates how well MLLMs understand and reason about spatial information from videos, introducing a dataset of 5,000+ questions testing spatial intelligence across 288 indoor scenes.
-----
https://arxiv.org/abs/2412.14171
🤔 Original Problem:
While MLLMs excel at language and basic visual tasks, their ability to understand spatial relationships, distances, and layouts from videos remains unexplored and untested.
-----
🔧 Methods in this Paper:
→ Created VSI-Bench, a comprehensive benchmark testing MLLMs' spatial understanding through 8 different tasks including object counting, distance estimation, and route planning
→ Developed evaluation metrics combining Multiple-Choice Answer accuracy and Mean Relative Accuracy for numerical predictions
→ Analyzed models' spatial reasoning through linguistic self-explanations and cognitive map generation
→ Tested both proprietary and open-source MLLMs on their ability to build mental maps from video input
-----
🔍 Key Insights:
→ MLLMs show emerging but subhuman spatial intelligence, trailing human performance by 33%
→ Spatial reasoning is the primary bottleneck, more than visual perception or language abilities
→ Traditional prompting techniques actually hurt performance on spatial tasks
→ Models build strong local spatial awareness but struggle with global understanding
-----
📊 Results:
→ Best proprietary models achieve 46% accuracy across tasks
→ Open-source models trail by 4-5% on average
→ Most models perform below chance level on complex spatial tasks
→ Explicitly generating cognitive maps improves distance reasoning by 10%
Share this post