"History-Guided Video Diffusion"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.06764
The challenge in video diffusion models lies in effectively using past frames (history) for guidance, especially with varying history lengths. Current models are limited by fixed-size conditioning and ineffective history dropout methods.
This paper proposes the Diffusion Forcing Transformer (DFoT) and History Guidance. DFoT enables flexible conditioning on any history portion. History Guidance enhances video quality and consistency using DFoT's capabilities.
-----
📌 DFoT's core innovation lies in per-frame independent noise levels. This training method enables a single model to handle variable history lengths during video generation, unlike fixed-conditioning models.
📌 History Guidance leverages DFoT's flexibility to significantly boost video quality and temporal consistency. Vanilla History Guidance alone achieves state-of-the-art results, particularly for long video generation.
📌 DFoT is practically advantageous due to its architectural compatibility. Existing video diffusion models can be fine-tuned into DFoT, readily adopting History Guidance without major modifications.
----------
Methods Explored in this Paper 🔧:
→ Introduces the Diffusion Forcing Transformer (DFoT). DFoT is a video diffusion framework. It enables flexible conditioning on any portion of the input history.
→ DFoT extends the "noising-as-masking" concept to video diffusion. It trains models by applying independent noise levels to each frame in a video sequence.
→ Unlike conventional methods treating history as separate conditioning, DFoT unifies history and generation target frames. This allows varying noise levels within a sequence.
→ DFoT training minimizes noise prediction loss across all frames. This includes frames with varying independent noise levels.
→ At sampling, DFoT can condition on arbitrary history portions by selectively masking history with noise. This allows flexible history lengths and guidance.
→ History Guidance (HG) is a family of guidance methods enabled by DFoT. It enhances video generation using history conditioning. HG includes Vanilla History Guidance (HG-v), Temporal History Guidance (HG-t), and Fractional History Guidance (HG-f).
-----
Key Insights 💡:
→ DFoT improves token utilization during training. It computes loss on all frames, not just a subset.
→ DFoT makes variable history lengths "in-distribution" during training. This leads to more flexible history usage during sampling.
→ Vanilla History Guidance (HG-v) significantly enhances video quality and consistency. This is achieved through classifier-free guidance using history.
→ Temporal History Guidance (HG-t) improves robustness to out-of-distribution history. It composes scores from different history subsequences.
→ Fractional History Guidance (HG-f) enhances motion dynamics. It conditions on history windows corrupted by varying noise levels, acting as a "low-pass filter" on history.
→ Combining HG-t and HG-f creates History Guidance across Time and Frequency (HG-tf). HG-tf further improves motion and enables compositional generalization.
-----
Results 📊:
→ DFoT achieves state-of-the-art performance on par with industry models. This is achieved with significantly less compute.
→ DFoT outperforms generic diffusion baselines under the same architecture. For example, DFoT (scratch) achieves an FVD of 4.3.
→ Vanilla History Guidance improves frame quality and consistency with increasing guidance scale.
→ Fractional History Guidance further lowers FVD to 170.4, surpassing DFoT without guidance (FVD 208.0) and other baselines.
→ DFoT with History Guidance enables stable generation of extremely long videos, exceeding 800 frames.