0:00
/
0:00
Transcript

"[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster"

FasterVLM introduces a training-free method to speed up Vision Language Models by pruning visual tokens using [CLS] attention, maintaining 90% performance while reducing computation by 95%.

Paper - https://arxiv.org/abs/2412.01818

Original Problem 🤔:

Vision Language Models are slow because they process too many visual tokens, especially with high-resolution images or videos.

-----

Key Insights 🧠:

→ Text-visual attention in LLMs is inaccurate due to attention shift and dispersion

→ [CLS] token attention from visual encoder provides more reliable importance signals

→ Visual tokens containing global information significantly impact VLM performance

-----

Solution in this Paper 💡:

→ FasterVLM evaluates visual token importance using [CLS] attention from the visual encoder

→ It removes less important tokens before they reach the LLM decoder

→ The method requires no additional training and works across different VLM architectures

→ Implementation is simple yet effective, making it easily adaptable to various VLMs

-----

Results 📊:

→ Maintains 90% of LLaVA-1.5-7B performance while pruning 95% of visual tokens

→ Reduces inference FLOPs by over 95%

→ Outperforms existing text-visual attention methods across 10 benchmarks

→ Performs well with high-resolution images and video inputs

Discussion about this video