0:00
/
0:00
Transcript

"A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs"

The podcast on this paper is generated with Google's Illuminate.

Small VLMs can efficiently guide large VLMs by identifying essential visual tokens

Leveraging small VLMs' attention patterns speeds up large VLM inference by 91%

This paper introduces a method to accelerate large Vision Language Models (VLMs) by using a small VLM to guide token pruning and early exiting. The approach maintains high performance while reducing computational costs by up to 91% through efficient visual token management.

-----

https://arxiv.org/abs/2412.03324

🔍 Original Problem:

Large VLMs face efficiency challenges due to processing numerous visual tokens. Current pruning methods using single-layer attention maps struggle to maintain accuracy at low token retention ratios.

-----

🛠️ Solution in this Paper:

→ The paper introduces Small VLM-Guided visual token Pruning (SGP), which uses attention maps from all layers of a small VLM to guide token pruning in large VLMs.

→ A complementary Small VLM Early Exiting (SEE) mechanism determines when to skip the large VLM entirely based on confidence scores.

→ The system aggregates attention maps across pre-filling and decoding stages to identify essential visual tokens.

→ When the small VLM's confidence exceeds a threshold, inference terminates without activating the large VLM.

-----

💡 Key Insights:

→ Global attention information from all layers better preserves essential tokens than single-layer attention

→ Small VLMs exhibit token retention patterns similar to large VLMs

→ Most "easy" questions can be correctly answered by small VLMs

-----

📊 Results:

→ Achieves 91% visual token pruning while maintaining competitive performance

→ Works across 11 benchmarks and multiple VLM architectures

→ Reduces computation costs significantly with minimal accuracy loss

Discussion about this video