0:00
/
0:00
Transcript

"CAT: Content-Adaptive Image Tokenization"

Generated below podcast on this paper with Google's Illuminate.

Smart image compression that adapts to content - more tokens for faces, fewer for landscapes.

CAT introduces dynamic image compression based on content complexity, using LLMs to analyze image captions and determine optimal compression ratios for better efficiency and quality.

-----

https://arxiv.org/abs/2501.03120

🤔 Original Problem:

Current image tokenizers use fixed compression ratios regardless of image content, leading to quality loss in complex images and wasted computation on simple ones.

-----

🔍 Solution in this Paper:

→ CAT uses LLMs to analyze image captions and predict content complexity scores on a 1-9 scale.

→ Based on the complexity score, images are assigned different compression ratios (8x, 16x, or 32x).

→ A nested VAE architecture with skip connections enables multiple compression levels within a single model.

→ The system processes simpler images (like landscapes) with higher compression and complex ones (like faces/text) with lower compression.

-----

💡 Key Insights:

→ Text descriptions and LLMs can effectively predict optimal image compression ratios

→ Complex images with faces or text need more tokens for quality preservation

→ Natural scenes can be compressed more aggressively without visible quality loss

→ Adaptive compression improves both reconstruction quality and computational efficiency

-----

📊 Results:

→ Reduced rFID by 12% on CelebA and 39% on ChartQA datasets

→ Achieved FID of 4.56 on ImageNet generation, outperforming fixed-ratio baselines

→ Improved inference throughput by 18.5%

→ Used 16% fewer tokens while maintaining quality on natural images

------

Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓

🎉 https://rohanpaul.substack.com/

Discussion about this video

User's avatar