"Parallelized Autoregressive Visual Generation"

Playback speed

Share post at current time

0:00

Transcript

"Parallelized Autoregressive Visual Generation"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 13, 2025

Parallel token generation at 9x speed by understanding which pixels play nice together.

A novel method to speed up autoregressive image generation by identifying and grouping weakly dependent tokens for parallel generation while maintaining sequential generation for strongly dependent ones.

-----

https://arxiv.org/abs/2412.15119

🤖 Original Problem:

→ Autoregressive models excel at visual generation but are painfully slow due to sequential token-by-token prediction.

→ Simply generating all tokens in parallel leads to inconsistent and distorted outputs.

-----

🔍 Key Insights:

→ Token dependencies naturally correlate with spatial distances - adjacent tokens have strong dependencies while distant tokens have weak correlations

→ Initial tokens in each region are crucial for global structure and must be generated sequentially

→ Tokens from distant spatial regions can be generated in parallel without quality loss

-----

⚡ Solution in this Paper:

→ First divide image into regions and generate initial tokens sequentially to establish global context.

→ Then identify and group weakly dependent tokens from distant regions for parallel generation.

→ Use bi-directional attention within parallel token groups while maintaining causal attention between groups.

→ Implement using standard transformer architecture with learnable transition tokens and 2D position embeddings.

-----

📊 Results:

→ 3.6x speedup with comparable quality (FID 2.29 vs 2.18)

→ Up to 9.5x speedup with minimal quality drop (0.7 FID increase)

→ Works for both image and video generation

→ Reduces generation steps from 576 to 147 for images

Rohan's Bytes

"Parallelized Autoregressive Visual Generation"

Discussion about this video