The paper construct a series of autoregressive video models, called Toto.
Toto turns videos into token sequences
So Toto introduces autoregressive pre-training from videos, treating them as sequences of visual tokens to train transformers for future token prediction, achieving competitive performance across diverse vision tasks.
-----
https://arxiv.org/abs/2501.05453
๐ค Original Problem:
โ Current vision models rely heavily on curated datasets and task-specific architectures, limiting their generalization ability
โ While LLMs excel with next-token prediction on text, similar approaches for video understanding remain underexplored
-----
๐ง Solution in this Paper:
โ Toto transforms videos into sequences of visual tokens using dVAE tokenizer
โ It trains a causal transformer with LLaMA architecture to predict next tokens in the sequence
โ The model processes both images and videos in a unified format through tokenization
โ Pre-training happens on over 1 trillion visual tokens from diverse sources
โ Uses attention pooling to extract representations from relevant model layers
-----
๐ก Key Insights:
โ Middle layers of decoder-only models contain the best representations for downstream tasks
โ Video models scale with compute but slower than text models, following power law L(C) = 7.32ยทC^-0.0378
โ Frame redundancy in videos makes next-token prediction easier than text, potentially limiting learning
-----
๐ Results:
โ Achieves 75.3% top-1 accuracy on ImageNet with 1B parameters
โ Matches state-of-art 74.4% accuracy on Kinetics-400 action recognition
โ Demonstrates 62.4% accuracy on DAVIS video tracking benchmark
------
Are you into AI and LLMsโ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. โโ
๐ https://rohanpaul.substack.com/