Distilled Decoding: The recipe for making autoregressive models run at warp speed
This paper enables one-step image generation from autoregressive models, dramatically reducing generation time while maintaining quality through flow matching and distillation techniques.
-----
https://arxiv.org/abs/2412.17153
Original Problem 🤔:
Autoregressive models generate high-quality images but are extremely slow due to token-by-token generation. For example, LlamaGen needs 256 steps (~5 seconds) for one 256×256 image.
-----
Solution in this Paper 🔧:
→ The paper introduces Distilled Decoding (DD), which maps noise tokens to generated tokens using flow matching
→ DD creates a deterministic mapping between Gaussian noise and the output distribution of pre-trained autoregressive models
→ The method trains a network to distill this mapping, enabling single-step generation
→ DD doesn't need the original training data, making it more practical for deployment
→ The solution allows flexible trade-off between quality and speed by supporting multi-step generation
-----
Key Insights 🎯:
→ Existing parallel token generation methods fundamentally cannot match original model distribution
→ Flow matching enables deterministic trajectories from noise to final output
→ The synergy between autoregressive and flow matching allows quality-speed trade-offs
→ Pre-trained models can be efficiently distilled without original training data
-----
Results 📊:
→ For VAR: 6.3× speedup (10 steps to 1) with FID increase from 4.19 to 9.96
→ For LlamaGen: 217.8× speedup (256 steps to 1) with FID from 4.11 to 11.35
→ Text-to-image: 92.9× speedup (256 to 2 steps) with minimal FID increase from 25.70 to 28.95
Share this post