"Look Every Frame All at Once: Video-Ma^2mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing"

Playback speed

Share post at current time

0:00

Transcript

"Look Every Frame All at Once: Video-Ma^2mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 29, 2024

State Space Models enable full video processing without frame sampling or memory bottlenecks.

Video-Ma^2mba introduces a novel architecture using State Space Models instead of attention mechanisms in LLMs for video processing. It achieves linear memory scaling and can process 2-hour videos at 1 FPS on a single GPU through Multi-Axis Gradient Checkpointing (MA-GC).

-----

https://arxiv.org/abs/2411.19460

🎯 Original Problem:

Current video-LLMs struggle with long videos due to quadratic memory growth from attention mechanisms, limiting them to processing only short clips or sparse frames.

-----

🛠️ Solution in this Paper:

→ Video-Ma2mba replaces attention with State Space Models from the Mamba-2 framework for linear scaling.

→ The Multi-Axis Gradient Checkpointing method stores activations strategically across layer and sequence dimensions.

→ MA-GC reduces memory complexity from O(L·S) to O(S), enabling processing of million-token sequences.

→ The model uses a three-stage training pipeline: cross-modal alignment, long video knowledge learning, and supervised fine-tuning.

-----

💡 Key Insights:

→ State Space Models can effectively replace attention for long sequence processing

→ Bi-directional checkpointing significantly reduces memory requirements

→ Full frame processing at 1 FPS captures better temporal information than sparse sampling

-----

📊 Results:

→ Processes sequences up to 2^19 tokens (2+ hours at 1 FPS)

→ Uses only 42.2GB memory at sequence length 2^19 vs 42.6GB at 2^14 without MA-GC

Rohan's Bytes

"Look Every Frame All at Once: Video-Ma^2mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing"

Discussion about this video