Do image patches follow language patterns? Not quite as we thought.
This paper explores the hidden math behind why vision models need different architectures.
Visual tokens: More chaotic, less structured than language tokens.
https://arxiv.org/abs/2411.05001
🎯 Original Problem:
Transformer models like LLaVA treat image patches as discrete tokens similar to words, but we don't know if these visual tokens follow similar statistical patterns as natural language.
-----
🔧 Solution in this Paper:
→ Analyzed visual token distributions across multiple datasets using VQ-VAE tokenizers
→ Compared frequency statistics, token innovation rates, and entropy patterns between visual and natural languages
→ Examined token granularity using correlation analysis with object parts
→ Used Compound Probabilistic Context-Free Grammars to study grammatical structures
-----
💡 Key Insights:
→ Visual tokens follow Zipfian distributions but with more uniform token usage
→ Almost all tokens appear within first 100 images, showing higher innovation rate
→ Visual tokens operate at intermediate granularity, representing object parts
→ Visual languages have higher entropy and lower compression ratios
→ Visual tokens lack cohesive grammatical structures compared to natural language
-----
📊 Results:
→ Visual languages show 2.9% compression rate vs 34.9% for natural languages
→ Visual tokens have 10.7±1.9 entropy vs 9.0±0.9 for natural language
→ Visual bi-grams show most natural distribution curves, suggesting potential correspondence with text uni-grams
Share this post