"Autoregressive Video Generation without Vector Quantization"

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

"Autoregressive Video Generation without Vector Quantization"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 06, 2025

Transcript

Meet NOVA: The video generator that thinks in frames, not pixels.

NOVA generates videos frame-by-frame without compression, making it faster and more efficient than existing models.

It introduces a non-quantized autoregressive model for efficient video generation by reformulating the problem into temporal frame-by-frame and spatial set-by-set prediction, achieving high fidelity with lower training costs.

-----

https://arxiv.org/abs/2412.14169

🤔 Original Problem:

Existing autoregressive video models struggle with high fidelity and compression simultaneously, requiring more tokens for quality but increasing costs significantly. Diffusion models lack flexibility for varied video lengths and in-context abilities.

-----

🔧 Solution in this Paper:

→ NOVA splits video generation into temporal frame-by-frame prediction and spatial set-by-set prediction within frames

→ Uses block-wise causal masking attention for temporal modeling, allowing frames to only attend to text prompts and preceding frames

→ Implements a Scaling and Shift Layer to handle cross-frame motion changes by learning relative distribution variations

→ Employs a diffusion procedure for per-token prediction in continuous-valued space

→ Integrates pre-trained language model for text encoding and OpenCV for optical flow computation

-----

💡 Key Insights:

→ Non-quantized approach enables high fidelity while maintaining compact compression

→ Frame-by-frame prediction preserves temporal causality while set-by-set prediction allows bidirectional modeling

→ Unified framework handles multiple visual generation tasks through in-context abilities

-----

📊 Results:

→ Achieves 2.75 FPS on single A100-40G GPU with batch size 24

→ Reaches VBench score of 80.1 for text-to-video generation

→ Attains GenEval score of 0.75 for text-to-image generation

→ Training requires only 342 GPU days on A100-40G

Rohan's Bytes

"Autoregressive Video Generation without Vector Quantization"

Discussion about this video