Intriguing Properties of Large Language and Vision Models

Playback speed

Share post at current time

0:00

Transcript

Generated this podcast with Google's Illuminate.

Dec 28, 2024

LLMs with vision capabilities (LLVMs) have a substantial gap between complex reasoning and basic perception capabilities.

This paper investigates how LLVMs truly perceive and process images.

Results 📊:

• LLaVA 1.5 shows minimal performance drop (0.19%) when visual patch order is shuffled

• LLaVA 1.5 performance drops only 1.8% on synthetic MathVista dataset

• Up to 20% drop in image classification tasks after alignment and visual instruction tuning

• Lower layers (bottom 20%) primarily process visual information, while higher layers focus on text interpretation

-----

Solution in this Paper 🛠️:

• Evaluated LLaVA-series models across 10 diverse benchmarks

• Analyzed permutation invariance, robustness to occlusion, and synthetic data handling

• Examined cross-modal alignment preservation and importance of model layers

• Investigated how LLVMs process visual information globally

-----

Key Insights from this Paper 💡:

• LLVMs exhibit permutation-invariant properties for visual patch tokens

• They can solve some problems without fully perceiving detailed image information

• Cross-modal alignment is overfitted to complex reasoning tasks, causing loss of original perceptual capabilities

• Lower layers (<25%) play a crucial role in visual understanding and performance

Rohan's Bytes