The paper construct a series of autoregressive video models, called Toto.
Toto turns videos into token sequences
So Toto introduces autoregressive pre-training from videos, treating them as sequences of visual tokens to train transformers for future token prediction, achieving competitive performance across diverse vision tasks.
-----
https://arxiv.org/abs/2501.05453
🤔 Original Problem:
→ Current vision models rely heavily on curated datasets and task-specific architectures, limiting their generalization ability
→ While LLMs excel with next-token prediction on text, similar approaches for video understanding remain underexplored
-----
🔧 Solution in this Paper:
→ Toto transforms videos into sequences of visual tokens using dVAE tokenizer
→ It trains a causal transformer with LLaMA architecture to predict next tokens in the sequence
→ The model processes both images and videos in a unified format through tokenization
→ Pre-training happens on over 1 trillion visual tokens from diverse sources
→ Uses attention pooling to extract representations from relevant model layers
-----
💡 Key Insights:
→ Middle layers of decoder-only models contain the best representations for downstream tasks
→ Video models scale with compute but slower than text models, following power law L(C) = 7.32·C^-0.0378
→ Frame redundancy in videos makes next-token prediction easier than text, potentially limiting learning
-----
📊 Results:
→ Achieves 75.3% top-1 accuracy on ImageNet with 1B parameters
→ Matches state-of-art 74.4% accuracy on Kinetics-400 action recognition
→ Demonstrates 62.4% accuracy on DAVIS video tracking benchmark
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post