0:00
/
0:00
Transcript

"VisionZip: Longer is Better but Not Necessary in Vision Language Models"

The podcast on this paper is generated with Google's Illuminate.

VisionZip squeezes out redundant visual tokens while keeping VLMs sharp and fast.

VisionZip introduces a method to reduce redundant visual tokens in Vision Language Models (VLMs) while maintaining performance. The paper demonstrates that current VLMs use excessive visual tokens compared to text tokens, leading to computational inefficiency. VisionZip selects informative tokens and merges remaining ones to preserve context.

-----

https://arxiv.org/abs/2412.04467

🔍 Original Problem:

Current VLMs use significantly more visual tokens (576-2880) compared to text tokens (dozens), causing high computational costs and limiting practical applications. Many visual tokens contain redundant information, making the process inefficient.

-----

🛠️ Solution in this Paper:

→ VisionZip selects dominant tokens based on attention scores from vision encoders

→ Remaining tokens are merged based on semantic similarity to create contextual tokens

→ The method works in training-free mode or with minimal fine-tuning (30 minutes)

→ Implementation is text-agnostic and compatible with various VLM architectures

-----

💡 Key Insights:

→ Visual tokens in popular encoders like CLIP and SigLIP contain significant redundancy

→ Only a few tokens receive high attention and contain most information

→ Text-agnostic approach performs better than text-relevant token selection

→ Method is particularly effective for multi-turn conversations

-----

📊 Results:

→ Achieves 95% performance with only 10% of original tokens

→ Reduces prefilling time by 8x for LLaVA-Next 7B

→ Enables LLaVA-Next 13B to run faster than 7B while maintaining better performance

→ Outperforms previous methods by 5% across all settings

Discussion about this video