Train video AI 30% faster by skipping frames that don't change.
Run-Length Tokenization (RLT) makes video transformers faster by removing duplicate frames, just like video compression.
Basically, why process the same frame twice when nothing changed?
https://arxiv.org/abs/2411.05222
🎯 Original Problem:
Video transformers are extremely slow to train due to processing redundant tokens from repeated frames, forcing researchers to work with short, low-resolution videos.
-----
🔧 Solution in this Paper:
→ Run-Length Tokenization (RLT) compares consecutive video patches and identifies repeated content over time
→ RLT removes redundant patches and replaces them with a single patch plus length encoding
→ The method requires no dataset-specific tuning and works with hardware optimizations like Flash Attention
→ RLT uses a learnable length positional encoding to help the model understand temporal relationships
→ Example packing is used to handle varying input sizes during training efficiently
-----
💡 Key Insights:
→ Most video frames contain redundant information that can be safely compressed
→ Content-aware token reduction is more effective than fixed reduction schemes
→ Simple patch comparison is sufficient for identifying redundant content
→ Length encoding preserves temporal information while reducing computation
-----
📊 Results:
→ Reduces fine-tuning time by 30% while matching baseline performance
→ Increases model throughput by 35% with only 0.1% accuracy drop
→ Speeds up 30 FPS training by over 100%
→ Reduces token count by up to 80% on longer video datasets
Share this post