MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

MarDini splits video generation into low-res planning and high-res details for efficient, high-quality results.

Rohan Paul

Nov 05, 2024

MarDini splits video generation into low-res planning and high-res details for efficient, high-quality results

Asymmetric network design makes complex video generation computationally feasible at high resolutions

Smart combination of masked auto-regression and diffusion, Planning at low resolution, generating at high resolution

🎯 Original Problem:

Video generation models face challenges in handling both temporal consistency and high-resolution spatial details while being computationally efficient. Current approaches either rely heavily on image pre-training or struggle with computational costs at high resolutions.

🔧 Solution in this Paper:

• Introduces MarDini - combines masked auto-regression (MAR) with diffusion models (DM) in asymmetric design

• Uses heavy MAR planning model for temporal dependencies at low resolution

• Employs lightweight DM for spatial details at high resolution

• Implements Identity Attention to handle noisy vs reference token disparity

• Uses progressive training strategy with gradually increasing task difficulty

• Introduces cross-attention mechanism between planning and generation models

💡 Key Insights:

• Asymmetric resolution design significantly reduces computational costs

• Direct video training without image pre-training is possible with progressive strategy

• Spatio-temporal attention becomes feasible at scale through asymmetric design

• Planning signals effectively guide temporal consistency

📊 Results:

• Achieves state-of-the-art FVD scores on VIDIM-Bench: 102.87 on DAVIS and 197.69 on UCF101

• Generates 12-frame clips at 512 resolution in just 6.05 seconds

• Requires only 42.57G GPU memory for high-resolution generation

• Performs competitively on VBench with 90.95 score without any image pre-training

🔧 How is MarDini trained?

The training process involves three progressive stages:

Initial Stage: Separate training of planning and generation models
Joint-Model Stage: Combined end-to-end training using masked diffusion loss
Joint-Task Stage: Gradually decreasing mask ratios to handle more challenging generation tasks

🔍 The asymmetric architecture consists of two main components:

A heavy-weight MAR planning model that processes low-resolution frames and generates planning signals
A lightweight DM generation model that uses these planning signals to produce high-resolution frames via diffusion denoising

Rohan's Bytes

Discussion about this post