0:00
/
0:00
Transcript

"Look Every Frame All at Once: Video-Ma^2mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing"

The podcast on this paper is generated with Google's Illuminate.

State Space Models enable full video processing without frame sampling or memory bottlenecks.

Video-Ma^2mba introduces a novel architecture using State Space Models instead of attention mechanisms in LLMs for video processing. It achieves linear memory scaling and can process 2-hour videos at 1 FPS on a single GPU through Multi-Axis Gradient Checkpointing (MA-GC).

-----

https://arxiv.org/abs/2411.19460

🎯 Original Problem:

Current video-LLMs struggle with long videos due to quadratic memory growth from attention mechanisms, limiting them to processing only short clips or sparse frames.

-----

🛠️ Solution in this Paper:

→ Video-Ma2mba replaces attention with State Space Models from the Mamba-2 framework for linear scaling.

→ The Multi-Axis Gradient Checkpointing method stores activations strategically across layer and sequence dimensions.

→ MA-GC reduces memory complexity from O(L·S) to O(S), enabling processing of million-token sequences.

→ The model uses a three-stage training pipeline: cross-modal alignment, long video knowledge learning, and supervised fine-tuning.

-----

💡 Key Insights:

→ State Space Models can effectively replace attention for long sequence processing

→ Bi-directional checkpointing significantly reduces memory requirements

→ Full frame processing at 1 FPS captures better temporal information than sparse sampling

-----

📊 Results:

→ Processes sequences up to 2^19 tokens (2+ hours at 1 FPS)

→ Uses only 42.2GB memory at sequence length 2^19 vs 42.6GB at 2^14 without MA-GC