0:00
/
0:00
Transcript

Intriguing Properties of Large Language and Vision Models

Generated this podcast with Google's Illuminate.

LLMs with vision capabilities (LLVMs) have a substantial gap between complex reasoning and basic perception capabilities.

This paper investigates how LLVMs truly perceive and process images.

📚 https://arxiv.org/abs/2410.04751

Results 📊:

• LLaVA 1.5 shows minimal performance drop (0.19%) when visual patch order is shuffled

• LLaVA 1.5 performance drops only 1.8% on synthetic MathVista dataset

• Up to 20% drop in image classification tasks after alignment and visual instruction tuning

• Lower layers (bottom 20%) primarily process visual information, while higher layers focus on text interpretation

-----

Solution in this Paper 🛠️:

• Evaluated LLaVA-series models across 10 diverse benchmarks

• Analyzed permutation invariance, robustness to occlusion, and synthetic data handling

• Examined cross-modal alignment preservation and importance of model layers

• Investigated how LLVMs process visual information globally

-----

Key Insights from this Paper 💡:

• LLVMs exhibit permutation-invariant properties for visual patch tokens

• They can solve some problems without fully perceiving detailed image information

• Cross-modal alignment is overfitted to complex reasoning tasks, causing loss of original perceptual capabilities

• Lower layers (<25%) play a crucial role in visual understanding and performance

Discussion about this video