0:00
/
0:00
Transcript

"Don't Look Twice: Faster Video Transformers with Run-Length Tokenization"

The podcast on this paper is generated with Google's Illuminate.

Train video AI 30% faster by skipping frames that don't change.

Run-Length Tokenization (RLT) makes video transformers faster by removing duplicate frames, just like video compression.

Basically, why process the same frame twice when nothing changed?

https://arxiv.org/abs/2411.05222

🎯 Original Problem:

Video transformers are extremely slow to train due to processing redundant tokens from repeated frames, forcing researchers to work with short, low-resolution videos.

-----

🔧 Solution in this Paper:

→ Run-Length Tokenization (RLT) compares consecutive video patches and identifies repeated content over time

→ RLT removes redundant patches and replaces them with a single patch plus length encoding

→ The method requires no dataset-specific tuning and works with hardware optimizations like Flash Attention

→ RLT uses a learnable length positional encoding to help the model understand temporal relationships

→ Example packing is used to handle varying input sizes during training efficiently

-----

💡 Key Insights:

→ Most video frames contain redundant information that can be safely compressed

→ Content-aware token reduction is more effective than fixed reduction schemes

→ Simple patch comparison is sufficient for identifying redundant content

→ Length encoding preserves temporal information while reducing computation

-----

📊 Results:

→ Reduces fine-tuning time by 30% while matching baseline performance

→ Increases model throughput by 35% with only 0.1% accuracy drop

→ Speeds up 30 FPS training by over 100%

→ Reduces token count by up to 80% on longer video datasets

Discussion about this video