"Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.06155
The creation of high-quality videos using Diffusion Transformer (DiT) models is slow due to complex computations and many sampling steps. This paper addresses the slow video generation in 3D DiTs. It introduces methods to make video generation faster without significantly sacrificing video quality.
This paper proposes Efficient Video Diffusion Transformer (EFFICIENT-VDIT). EFFICIENT-VDIT uses sparse attention and reduces sampling steps. This makes video generation faster.
-----
📌 Exploits inherent video data redundancy via "Attention Tile" discovery. This allows for targeted sparse attention, reducing compute without significant quality loss. Achieves up to 7.8x speedup.
📌 Employs a refined three-stage training: Multi-Step Consistency Distillation, layer-wise sparsity search, and knowledge distillation. This pipeline efficiently transforms a pre-trained 3D Diffusion Transformer into a fast, performant model.
📌 EFFICIENT-VDIT is immediately deployable. It offers substantial inference acceleration, up to 7.8x faster, while preserving video quality. It also integrates well with distributed inference systems for further speedup.
----------
Methods Explored in this Paper 🔧:
→ This paper introduces "Attention Tile," a repetitive pattern in 3D DiT attention maps. Attention Tile shows that not every part of a video frame needs to attend to all other parts during processing.
→ Based on Attention Tile, the paper proposes sparse 3D attention. Sparse attention reduces computational complexity from quadratic to linear with respect to the number of video frames. Each frame attends to only a limited set of other frames.
→ Multi-step Consistency Distillation (MCD) is used to shorten the sampling process. MCD divides the video generation steps into segments. Consistency distillation is applied within each segment to enable faster generation.
→ EFFICIENT-VDIT combines sparse attention and MCD in a three-stage training process. The stages are Multi-Step Consistency Distillation, Layer-wise Sparsity Search, and Knowledge Distillation. This pipeline refines a pre-trained 3D DiT model to be efficient without losing much quality.
-----
Key Insights 💡:
→ Attention maps in 3D DiTs exhibit a repetitive "Attention Tile" structure. This redundancy can be exploited for efficiency.
→ Attention scores on the diagonal tiles of the attention map are significantly higher than off-diagonal tiles. This suggests focusing computation on diagonal tiles.
→ The "Attention Tile" pattern is consistent across different video inputs. This allows using a fixed sparse attention mask for various videos.
→ Different layers in the DiT model can tolerate different levels of sparsity. Layer-wise sparsity search optimizes performance by adapting sparsity per layer.
-----
Results 📊:
→ Achieves 7.4× to 7.8× speedup in generating 29 and 93 frame 720p videos. This is with a marginal performance trade-off on the VBench benchmark.
→ Reduces computation time using sparse attention kernels. For example, a 2:6 attention mask achieves a 1.86× speedup.
→ Maintains video quality on VBench, with scores within 1% of the original model, even at higher speedup ratios.
→ Demonstrates up to 3.91× additional speedup with distributed inference on 4 GPUs using sequence parallelism.