0:00
/
0:00
Transcript

"Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration"

The podcast on this paper is generated with Google's Illuminate.

This framework turns messy token reduction techniques into a clean, three-stage pipeline.

This paper introduces a unified framework called "filter-correlate-compress" to standardize token reduction methods in Multimodal LLMs (MLLMs). It solves the problem of fragmented, unclear token reduction approaches by decomposing them into three distinct stages, making them more understandable and extensible.

-----

https://arxiv.org/abs/2411.17686

🤔 Original Problem:

MLLMs suffer from high computational costs due to quadratic complexity with sequence length. While various token reduction methods exist, they lack clear structure and comparison frameworks, making it hard to understand their effectiveness or build upon them.

-----

🔧 Solution in this Paper:

→ The paper introduces a three-stage paradigm: filter (identify tokens to discard), correlate (determine where to preserve information), and compress (fuse tokens efficiently).

→ The FiCoCo method implements this paradigm in three variants: FiCoCo-V (visual encoder), FiCoCo-L (LLM decoder), and FiCoCo-VL (both phases).

→ Each stage maintains consistent design objectives while allowing flexible implementations, making it adaptable across different approaches.

-----

💡 Key Insights:

→ Token reduction methods can be unified under a common framework

→ Visual tokens need balanced local and task-specific redundancy assessment

→ Token-adaptive compression outperforms fixed token merging strategies

→ Reducing tokens in LLM decoder phase shows better performance than visual encoder

-----

📊 Results:

→ Achieves 82.4% FLOPs reduction with minimal performance impact

→ Requires only 17.6% computational cost compared to base model

→ Uses approximately 67.6% GPU memory of original LLaVA-1.5-7B

Discussion about this video

User's avatar