LLaVA-Mini squeezes visual understanding into one token, making multimodal AI blazing fast.
Enables real-time multimodal interactions by compressing visual inputs into just 1 vision token while maintaining strong performance.
-----
https://arxiv.org/abs/2501.03895
🤔 Original Problem:
LLMs with visual capabilities require processing hundreds of vision tokens per image, causing high computational costs and slow inference. This severely limits real-time applications and processing of high-resolution images or long videos.
-----
🔍 Key Insights:
→ Vision tokens are most critical in early layers of LLM where visual information gets fused into text tokens
→ Later layers focus mainly on instruction tokens that have already incorporated visual context
→ Pre-fusing visual information before compression preserves understanding better than direct token reduction
-----
⚡ Solution in this Paper:
→ Introduces modality pre-fusion to integrate visual information into text tokens before the LLM backbone
→ Uses query-based compression to reduce hundreds of vision tokens into just 1 token
→ Employs learnable compression queries that interact with vision tokens through cross-attention
→ Preserves spatial information using 2D positional encoding during compression
→ Enables processing high-resolution images by splitting into sub-images and videos at 1 frame per second
-----
📊 Results:
→ Matches LLaVA-v1.5 performance using 0.17% of vision tokens (1 vs 576)
→ Reduces FLOPs by 77% and latency from 100ms to 40ms
→ Decreases per-image memory from 360MB to 0.6MB
→ Processes 3+ hour videos on consumer GPUs
Share this post