Parallel token generation at 9x speed by understanding which pixels play nice together.
A novel method to speed up autoregressive image generation by identifying and grouping weakly dependent tokens for parallel generation while maintaining sequential generation for strongly dependent ones.
-----
https://arxiv.org/abs/2412.15119
🤖 Original Problem:
→ Autoregressive models excel at visual generation but are painfully slow due to sequential token-by-token prediction.
→ Simply generating all tokens in parallel leads to inconsistent and distorted outputs.
-----
🔍 Key Insights:
→ Token dependencies naturally correlate with spatial distances - adjacent tokens have strong dependencies while distant tokens have weak correlations
→ Initial tokens in each region are crucial for global structure and must be generated sequentially
→ Tokens from distant spatial regions can be generated in parallel without quality loss
-----
⚡ Solution in this Paper:
→ First divide image into regions and generate initial tokens sequentially to establish global context.
→ Then identify and group weakly dependent tokens from distant regions for parallel generation.
→ Use bi-directional attention within parallel token groups while maintaining causal attention between groups.
→ Implement using standard transformer architecture with learnable transition tokens and 2D position embeddings.
-----
📊 Results:
→ 3.6x speedup with comparable quality (FID 2.29 vs 2.18)
→ Up to 9.5x speedup with minimal quality drop (0.7 FID increase)
→ Works for both image and video generation
→ Reduces generation steps from 576 to 147 for images
Share this post