Multimodal LLMs (MLLMs) face deployment challenges due to high computational costs and long inference times, primarily from lengthy visual token sequences. This paper proposes a learning-free token compression method to address this, operating on both spatial and temporal dimensions of visual data to reduce redundancy.
-----
https://arxiv.org/abs/2501.17391
📌 This paper's approach cleverly utilizes inherent video data properties. Temporal redundancy is addressed via frame differencing. Spatial sparsity is handled by focusing on important regions guided by text prompts.
📌 The plug-and-play nature of this method offers a significant advantage. Integration into existing Multimodal LLM frameworks is straightforward. No model retraining is needed, preserving original capabilities.
📌 The dual compression strategy targets distinct data characteristics. Frame-to-frame similarity is used to compress temporally redundant information. Simultaneously, text-based attention guides spatial compression, to enhancing computational efficiency.
----------
Methods Explored in this Paper 🔧:
→ Temporal Compression: Reduces redundancy by merging similar adjacent frame tokens. It uses differential-based scores (calculating differences between neighboring frames) and similarity-based scores (calculating cosine similarity between neighboring frames).
→ Spatial Compression: Prunes irrelevant tokens by identifying key regions. It uses topic-based scores (using the [CLS] token to find subject-related information) and text-based scores (calculating a similarity matrix between visual tokens and text prompts).
→ Text-based Compression : calculates correlations between aligned visual tokens and text prompts to extract only question related visual token information and prune other irrelevant visual tokens
-----
Key Insights 💡:
→ Visual representations in MLLMs have redundancy in the temporal dimension and sparsity in the spatial dimension.
→ Compressing visual tokens across both dimensions improves inference capability and reduces computational cost, without sacrificing performance significantly.
→ Text-based spatial compression is more effective than topic-based spatial compression.
-----
Results 📊:
→ Joint temporal (similarity-based) and spatial (text-based) compression (T=25%, S=25%) achieves an accuracy of 0.626, higher accuracy against the baseline model, while offering considerable computation saving.
→ Temporal compression alone (Differential, 25% rate) reduces inference time to 1.289s compared to the baseline of 2.049s.
Share this post