0:00
/
0:00
Transcript

"An Empirical Study of Autoregressive Pre-training from Videos"

Generated below podcast on this paper with Google's Illuminate.

The paper construct a series of autoregressive video models, called Toto.

Toto turns videos into token sequences

So Toto introduces autoregressive pre-training from videos, treating them as sequences of visual tokens to train transformers for future token prediction, achieving competitive performance across diverse vision tasks.

-----

https://arxiv.org/abs/2501.05453

🤔 Original Problem:

→ Current vision models rely heavily on curated datasets and task-specific architectures, limiting their generalization ability

→ While LLMs excel with next-token prediction on text, similar approaches for video understanding remain underexplored

-----

🔧 Solution in this Paper:

→ Toto transforms videos into sequences of visual tokens using dVAE tokenizer

→ It trains a causal transformer with LLaMA architecture to predict next tokens in the sequence

→ The model processes both images and videos in a unified format through tokenization

→ Pre-training happens on over 1 trillion visual tokens from diverse sources

→ Uses attention pooling to extract representations from relevant model layers

-----

💡 Key Insights:

→ Middle layers of decoder-only models contain the best representations for downstream tasks

→ Video models scale with compute but slower than text models, following power law L(C) = 7.32·C^-0.0378

→ Frame redundancy in videos makes next-token prediction easier than text, potentially limiting learning

-----

📊 Results:

→ Achieves 75.3% top-1 accuracy on ImageNet with 1B parameters

→ Matches state-of-art 74.4% accuracy on Kinetics-400 action recognition

→ Demonstrates 62.4% accuracy on DAVIS video tracking benchmark

------

Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓

🎉 https://rohanpaul.substack.com/

Discussion about this video