0:00
/
0:00
Transcript

"An Empirical Study of Autoregressive Pre-training from Videos"

Generated below podcast on this paper with Google's Illuminate.

The paper construct a series of autoregressive video models, called Toto.

Toto turns videos into token sequences

So Toto introduces autoregressive pre-training from videos, treating them as sequences of visual tokens to train transformers for future token prediction, achieving competitive performance across diverse vision tasks.

-----

https://arxiv.org/abs/2501.05453

๐Ÿค” Original Problem:

โ†’ Current vision models rely heavily on curated datasets and task-specific architectures, limiting their generalization ability

โ†’ While LLMs excel with next-token prediction on text, similar approaches for video understanding remain underexplored

-----

๐Ÿ”ง Solution in this Paper:

โ†’ Toto transforms videos into sequences of visual tokens using dVAE tokenizer

โ†’ It trains a causal transformer with LLaMA architecture to predict next tokens in the sequence

โ†’ The model processes both images and videos in a unified format through tokenization

โ†’ Pre-training happens on over 1 trillion visual tokens from diverse sources

โ†’ Uses attention pooling to extract representations from relevant model layers

-----

๐Ÿ’ก Key Insights:

โ†’ Middle layers of decoder-only models contain the best representations for downstream tasks

โ†’ Video models scale with compute but slower than text models, following power law L(C) = 7.32ยทC^-0.0378

โ†’ Frame redundancy in videos makes next-token prediction easier than text, potentially limiting learning

-----

๐Ÿ“Š Results:

โ†’ Achieves 75.3% top-1 accuracy on ImageNet with 1B parameters

โ†’ Matches state-of-art 74.4% accuracy on Kinetics-400 action recognition

โ†’ Demonstrates 62.4% accuracy on DAVIS video tracking benchmark

------

Are you into AI and LLMsโ“ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. โ†“โ†“

๐ŸŽ‰ https://rohanpaul.substack.com/

Discussion about this video

User's avatar