0:00
/
0:00
Transcript

"Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching"

Generated below podcast on this paper with Google's Illuminate.

Distilled Decoding: The recipe for making autoregressive models run at warp speed

This paper enables one-step image generation from autoregressive models, dramatically reducing generation time while maintaining quality through flow matching and distillation techniques.

-----

https://arxiv.org/abs/2412.17153

Original Problem 🤔:

Autoregressive models generate high-quality images but are extremely slow due to token-by-token generation. For example, LlamaGen needs 256 steps (~5 seconds) for one 256×256 image.

-----

Solution in this Paper 🔧:

→ The paper introduces Distilled Decoding (DD), which maps noise tokens to generated tokens using flow matching

→ DD creates a deterministic mapping between Gaussian noise and the output distribution of pre-trained autoregressive models

→ The method trains a network to distill this mapping, enabling single-step generation

→ DD doesn't need the original training data, making it more practical for deployment

→ The solution allows flexible trade-off between quality and speed by supporting multi-step generation

-----

Key Insights 🎯:

→ Existing parallel token generation methods fundamentally cannot match original model distribution

→ Flow matching enables deterministic trajectories from noise to final output

→ The synergy between autoregressive and flow matching allows quality-speed trade-offs

→ Pre-trained models can be efficiently distilled without original training data

-----

Results 📊:

→ For VAR: 6.3× speedup (10 steps to 1) with FID increase from 4.19 to 9.96

→ For LlamaGen: 217.8× speedup (256 steps to 1) with FID from 4.11 to 11.35

→ Text-to-image: 92.9× speedup (256 to 2 steps) with minimal FID increase from 25.70 to 28.95

Discussion about this video