Single visual token + largest possible LLM = optimal VLM (Vision Language Models) performance
Maximize LLM size, minimize visual tokens - the secret to faster VLMs
https://arxiv.org/abs/2411.03312
Original Problem 🤔:
Vision Language Models (VLMs) face high latency during real-world deployment due to processing hundreds of image tokens. Current solutions either reduce LLM size or decrease visual tokens, but the optimal trade-off between these approaches remains unclear.
-----
Key Insights 🔍:
→ Error varies 5x faster with LLM parameters than with visual tokens
→ For visual reasoning tasks, using just one visual token with largest possible LLM gives best performance
→ For OCR/text-heavy tasks, more visual tokens become crucial - trend reverses completely
→ Optimal token count increases with text input length, but larger LLM still remains beneficial
→ Current token compression research needs shifting focus from moderate (576→144) to extreme compression (576→1/4)
-----
Solution in this Paper 🛠️:
→ Developed scaling laws modeling VLM performance as function of both LLM size and visual token count
→ Experimented with Qwen-{0.5,1.8,4,7,14}B models evaluating on 9 visual reasoning benchmarks
→ Created QueCC (Query-based Convolutional Cross-attention) compression technique for high compression scenarios
→ Used learnable depth-wise 2D convolution filter instead of bilinear interpolation for better token downsampling
-----
Results 📊:
→ QueCC improved gap between original LLaVA-1.5 and next-best method by 12% on MME and 19% on MMB at one-token level
→ At four-token level, gap reduced by 26% on MME and 21% on MMB
→ Maintained strong performance on GQA, MME, ScienceQA, and VQAv2 across compression rates
Share this post