Small VLMs can efficiently guide large VLMs by identifying essential visual tokens
Leveraging small VLMs' attention patterns speeds up large VLM inference by 91%
This paper introduces a method to accelerate large Vision Language Models (VLMs) by using a small VLM to guide token pruning and early exiting. The approach maintains high performance while reducing computational costs by up to 91% through efficient visual token management.
-----
https://arxiv.org/abs/2412.03324
🔍 Original Problem:
Large VLMs face efficiency challenges due to processing numerous visual tokens. Current pruning methods using single-layer attention maps struggle to maintain accuracy at low token retention ratios.
-----
🛠️ Solution in this Paper:
→ The paper introduces Small VLM-Guided visual token Pruning (SGP), which uses attention maps from all layers of a small VLM to guide token pruning in large VLMs.
→ A complementary Small VLM Early Exiting (SEE) mechanism determines when to skip the large VLM entirely based on confidence scores.
→ The system aggregates attention maps across pre-filling and decoding stages to identify essential visual tokens.
→ When the small VLM's confidence exceeds a threshold, inference terminates without activating the large VLM.
-----
💡 Key Insights:
→ Global attention information from all layers better preserves essential tokens than single-layer attention
→ Small VLMs exhibit token retention patterns similar to large VLMs
→ Most "easy" questions can be correctly answered by small VLMs
-----
📊 Results:
→ Achieves 91% visual token pruning while maintaining competitive performance
→ Works across 11 benchmarks and multiple VLM architectures
→ Reduces computation costs significantly with minimal accuracy loss
Share this post