Meet NOVA: The video generator that thinks in frames, not pixels.
NOVA generates videos frame-by-frame without compression, making it faster and more efficient than existing models.
It introduces a non-quantized autoregressive model for efficient video generation by reformulating the problem into temporal frame-by-frame and spatial set-by-set prediction, achieving high fidelity with lower training costs.
-----
https://arxiv.org/abs/2412.14169
🤔 Original Problem:
Existing autoregressive video models struggle with high fidelity and compression simultaneously, requiring more tokens for quality but increasing costs significantly. Diffusion models lack flexibility for varied video lengths and in-context abilities.
-----
🔧 Solution in this Paper:
→ NOVA splits video generation into temporal frame-by-frame prediction and spatial set-by-set prediction within frames
→ Uses block-wise causal masking attention for temporal modeling, allowing frames to only attend to text prompts and preceding frames
→ Implements a Scaling and Shift Layer to handle cross-frame motion changes by learning relative distribution variations
→ Employs a diffusion procedure for per-token prediction in continuous-valued space
→ Integrates pre-trained language model for text encoding and OpenCV for optical flow computation
-----
💡 Key Insights:
→ Non-quantized approach enables high fidelity while maintaining compact compression
→ Frame-by-frame prediction preserves temporal causality while set-by-set prediction allows bidirectional modeling
→ Unified framework handles multiple visual generation tasks through in-context abilities
-----
📊 Results:
→ Achieves 2.75 FPS on single A100-40G GPU with batch size 24
→ Reaches VBench score of 80.1 for text-to-video generation
→ Attains GenEval score of 0.75 for text-to-image generation
→ Training requires only 342 GPU days on A100-40G
Share this post