0:00
/
0:00
Transcript

"Parallelized Autoregressive Visual Generation"

Generated below podcast on this paper with Google's Illuminate.

Parallel token generation at 9x speed by understanding which pixels play nice together.

A novel method to speed up autoregressive image generation by identifying and grouping weakly dependent tokens for parallel generation while maintaining sequential generation for strongly dependent ones.

-----

https://arxiv.org/abs/2412.15119

🤖 Original Problem:

→ Autoregressive models excel at visual generation but are painfully slow due to sequential token-by-token prediction.

→ Simply generating all tokens in parallel leads to inconsistent and distorted outputs.

-----

🔍 Key Insights:

→ Token dependencies naturally correlate with spatial distances - adjacent tokens have strong dependencies while distant tokens have weak correlations

→ Initial tokens in each region are crucial for global structure and must be generated sequentially

→ Tokens from distant spatial regions can be generated in parallel without quality loss

-----

⚡ Solution in this Paper:

→ First divide image into regions and generate initial tokens sequentially to establish global context.

→ Then identify and group weakly dependent tokens from distant regions for parallel generation.

→ Use bi-directional attention within parallel token groups while maintaining causal attention between groups.

→ Implement using standard transformer architecture with learnable transition tokens and 2D position embeddings.

-----

📊 Results:

→ 3.6x speedup with comparable quality (FID 2.29 vs 2.18)

→ Up to 9.5x speedup with minimal quality drop (0.7 FID increase)

→ Works for both image and video generation

→ Reduces generation steps from 576 to 147 for images

Discussion about this video

User's avatar