LLMs with vision capabilities (LLVMs) have a substantial gap between complex reasoning and basic perception capabilities.
This paper investigates how LLVMs truly perceive and process images.
📚 https://arxiv.org/abs/2410.04751
Results 📊:
• LLaVA 1.5 shows minimal performance drop (0.19%) when visual patch order is shuffled
• LLaVA 1.5 performance drops only 1.8% on synthetic MathVista dataset
• Up to 20% drop in image classification tasks after alignment and visual instruction tuning
• Lower layers (bottom 20%) primarily process visual information, while higher layers focus on text interpretation
-----
Solution in this Paper 🛠️:
• Evaluated LLaVA-series models across 10 diverse benchmarks
• Analyzed permutation invariance, robustness to occlusion, and synthetic data handling
• Examined cross-modal alignment preservation and importance of model layers
• Investigated how LLVMs process visual information globally
-----
Key Insights from this Paper 💡:
• LLVMs exhibit permutation-invariant properties for visual patch tokens
• They can solve some problems without fully perceiving detailed image information
• Cross-modal alignment is overfitted to complex reasoning tasks, causing loss of original perceptual capabilities
• Lower layers (<25%) play a crucial role in visual understanding and performance
Share this post