0:00
/
0:00
Transcript

"Inference Optimal VLMs Need Only One Visual Token but Larger Models"

The podcast on this paper is generated with Google's Illuminate.

Single visual token + largest possible LLM = optimal VLM (Vision Language Models) performance

Maximize LLM size, minimize visual tokens - the secret to faster VLMs

https://arxiv.org/abs/2411.03312

Original Problem 🤔:

Vision Language Models (VLMs) face high latency during real-world deployment due to processing hundreds of image tokens. Current solutions either reduce LLM size or decrease visual tokens, but the optimal trade-off between these approaches remains unclear.

-----

Key Insights 🔍:

→ Error varies 5x faster with LLM parameters than with visual tokens

→ For visual reasoning tasks, using just one visual token with largest possible LLM gives best performance

→ For OCR/text-heavy tasks, more visual tokens become crucial - trend reverses completely

→ Optimal token count increases with text input length, but larger LLM still remains beneficial

→ Current token compression research needs shifting focus from moderate (576→144) to extreme compression (576→1/4)

-----

Solution in this Paper 🛠️:

→ Developed scaling laws modeling VLM performance as function of both LLM size and visual token count

→ Experimented with Qwen-{0.5,1.8,4,7,14}B models evaluating on 9 visual reasoning benchmarks

→ Created QueCC (Query-based Convolutional Cross-attention) compression technique for high compression scenarios

→ Used learnable depth-wise 2D convolution filter instead of bilinear interpolation for better token downsampling

-----

Results 📊:

→ QueCC improved gap between original LLaVA-1.5 and next-best method by 12% on MME and 19% on MMB at one-token level

→ At four-token level, gap reduced by 26% on MME and 21% on MMB

→ Maintained strong performance on GQA, MME, ScienceQA, and VQAv2 across compression rates

Discussion about this video