State Space Models enable full video processing without frame sampling or memory bottlenecks.
Video-Ma^2mba introduces a novel architecture using State Space Models instead of attention mechanisms in LLMs for video processing. It achieves linear memory scaling and can process 2-hour videos at 1 FPS on a single GPU through Multi-Axis Gradient Checkpointing (MA-GC).
-----
https://arxiv.org/abs/2411.19460
🎯 Original Problem:
Current video-LLMs struggle with long videos due to quadratic memory growth from attention mechanisms, limiting them to processing only short clips or sparse frames.
-----
🛠️ Solution in this Paper:
→ Video-Ma2mba replaces attention with State Space Models from the Mamba-2 framework for linear scaling.
→ The Multi-Axis Gradient Checkpointing method stores activations strategically across layer and sequence dimensions.
→ MA-GC reduces memory complexity from O(L·S) to O(S), enabling processing of million-token sequences.
→ The model uses a three-stage training pipeline: cross-modal alignment, long video knowledge learning, and supervised fine-tuning.
-----
💡 Key Insights:
→ State Space Models can effectively replace attention for long sequence processing
→ Bi-directional checkpointing significantly reduces memory requirements
→ Full frame processing at 1 FPS captures better temporal information than sparse sampling
-----
📊 Results:
→ Processes sequences up to 2^19 tokens (2+ hours at 1 FPS)
→ Uses only 42.2GB memory at sequence length 2^19 vs 42.6GB at 2^14 without MA-GC
Share this post