FasterVLM introduces a training-free method to speed up Vision Language Models by pruning visual tokens using [CLS] attention, maintaining 90% performance while reducing computation by 95%.
Paper - https://arxiv.org/abs/2412.01818
Original Problem 🤔:
Vision Language Models are slow because they process too many visual tokens, especially with high-resolution images or videos.
-----
Key Insights 🧠:
→ Text-visual attention in LLMs is inaccurate due to attention shift and dispersion
→ [CLS] token attention from visual encoder provides more reliable importance signals
→ Visual tokens containing global information significantly impact VLM performance
-----
Solution in this Paper 💡:
→ FasterVLM evaluates visual token importance using [CLS] attention from the visual encoder
→ It removes less important tokens before they reach the LLM decoder
→ The method requires no additional training and works across different VLM architectures
→ Implementation is simple yet effective, making it easily adaptable to various VLMs
-----
Results 📊:
→ Maintains 90% of LLaVA-1.5-7B performance while pruning 95% of visual tokens
→ Reduces inference FLOPs by over 95%
→ Outperforms existing text-visual attention methods across 10 benchmarks
→ Performs well with high-resolution images and video inputs
Share this post