0:00
/
0:00
Transcript

"LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token"

Generated below podcast on this paper with Google's Illuminate.

LLaVA-Mini squeezes visual understanding into one token, making multimodal AI blazing fast.

Enables real-time multimodal interactions by compressing visual inputs into just 1 vision token while maintaining strong performance.

-----

https://arxiv.org/abs/2501.03895

🤔 Original Problem:

LLMs with visual capabilities require processing hundreds of vision tokens per image, causing high computational costs and slow inference. This severely limits real-time applications and processing of high-resolution images or long videos.

-----

🔍 Key Insights:

→ Vision tokens are most critical in early layers of LLM where visual information gets fused into text tokens

→ Later layers focus mainly on instruction tokens that have already incorporated visual context

→ Pre-fusing visual information before compression preserves understanding better than direct token reduction

-----

⚡ Solution in this Paper:

→ Introduces modality pre-fusion to integrate visual information into text tokens before the LLM backbone

→ Uses query-based compression to reduce hundreds of vision tokens into just 1 token

→ Employs learnable compression queries that interact with vision tokens through cross-attention

→ Preserves spatial information using 2D positional encoding during compression

→ Enables processing high-resolution images by splitting into sub-images and videos at 1 frame per second

-----

📊 Results:

→ Matches LLaVA-v1.5 performance using 0.17% of vision tokens (1 vs 576)

→ Reduces FLOPs by 77% and latency from 100ms to 40ms

→ Decreases per-image memory from 360MB to 0.6MB

→ Processes 3+ hour videos on consumer GPUs

Discussion about this video