0:00
/
0:00
Transcript

"Learning Free Token Reduction for Multi-Modal LLM"

Below podcast on this paper is generated with Google's Illuminate.

Multimodal LLMs (MLLMs) face deployment challenges due to high computational costs and long inference times, primarily from lengthy visual token sequences. This paper proposes a learning-free token compression method to address this, operating on both spatial and temporal dimensions of visual data to reduce redundancy.

-----

https://arxiv.org/abs/2501.17391

📌 This paper's approach cleverly utilizes inherent video data properties. Temporal redundancy is addressed via frame differencing. Spatial sparsity is handled by focusing on important regions guided by text prompts.

📌 The plug-and-play nature of this method offers a significant advantage. Integration into existing Multimodal LLM frameworks is straightforward. No model retraining is needed, preserving original capabilities.

📌 The dual compression strategy targets distinct data characteristics. Frame-to-frame similarity is used to compress temporally redundant information. Simultaneously, text-based attention guides spatial compression, to enhancing computational efficiency.

----------

Methods Explored in this Paper 🔧:

→ Temporal Compression: Reduces redundancy by merging similar adjacent frame tokens. It uses differential-based scores (calculating differences between neighboring frames) and similarity-based scores (calculating cosine similarity between neighboring frames).

→ Spatial Compression: Prunes irrelevant tokens by identifying key regions. It uses topic-based scores (using the [CLS] token to find subject-related information) and text-based scores (calculating a similarity matrix between visual tokens and text prompts).

→ Text-based Compression : calculates correlations between aligned visual tokens and text prompts to extract only question related visual token information and prune other irrelevant visual tokens

-----

Key Insights 💡:

→ Visual representations in MLLMs have redundancy in the temporal dimension and sparsity in the spatial dimension.

→ Compressing visual tokens across both dimensions improves inference capability and reduces computational cost, without sacrificing performance significantly.

→ Text-based spatial compression is more effective than topic-based spatial compression.

-----

Results 📊:

→ Joint temporal (similarity-based) and spatial (text-based) compression (T=25%, S=25%) achieves an accuracy of 0.626, higher accuracy against the baseline model, while offering considerable computation saving.

→ Temporal compression alone (Differential, 25% rate) reduces inference time to 1.289s compared to the baseline of 2.049s.

Discussion about this video

User's avatar