"Inference Optimal VLMs Need Only One Visual Token but Larger Models"

Playback speed

Share post at current time

0:00

Transcript

"Inference Optimal VLMs Need Only One Visual Token but Larger Models"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 28, 2024

Transcript

Single visual token + largest possible LLM = optimal VLM (Vision Language Models) performance

Maximize LLM size, minimize visual tokens - the secret to faster VLMs

https://arxiv.org/abs/2411.03312

Original Problem 🤔:

Vision Language Models (VLMs) face high latency during real-world deployment due to processing hundreds of image tokens. Current solutions either reduce LLM size or decrease visual tokens, but the optimal trade-off between these approaches remains unclear.

-----

Key Insights 🔍:

→ Error varies 5x faster with LLM parameters than with visual tokens

→ For visual reasoning tasks, using just one visual token with largest possible LLM gives best performance

→ For OCR/text-heavy tasks, more visual tokens become crucial - trend reverses completely

→ Optimal token count increases with text input length, but larger LLM still remains beneficial

→ Current token compression research needs shifting focus from moderate (576→144) to extreme compression (576→1/4)

-----

Solution in this Paper 🛠️:

→ Developed scaling laws modeling VLM performance as function of both LLM size and visual token count

→ Experimented with Qwen-{0.5,1.8,4,7,14}B models evaluating on 9 visual reasoning benchmarks

→ Created QueCC (Query-based Convolutional Cross-attention) compression technique for high compression scenarios

→ Used learnable depth-wise 2D convolution filter instead of bilinear interpolation for better token downsampling

-----

Results 📊:

→ QueCC improved gap between original LLaVA-1.5 and next-best method by 12% on MME and 19% on MMB at one-token level

→ At four-token level, gap reduced by 26% on MME and 21% on MMB

→ Maintained strong performance on GQA, MME, ScienceQA, and VQAv2 across compression rates

Rohan's Bytes

"Inference Optimal VLMs Need Only One Visual Token but Larger Models"

Discussion about this video