Sparse autoencoders (SAEs) crack open SDXL Turbo's black box by revealing how different parts control image generation
Discover how SDXL Turbo's neural blocks collaborate to turn text into stunning images
📚 https://arxiv.org/abs/2410.22366
🤖 Original Problem:
Text-to-image models like SDXL Turbo are black boxes - we don't understand how they work internally. While Sparse Autoencoders (SAEs) help interpret LLMs by breaking down their internal representations, similar analysis tools don't exist for image generation models.
-----
🔍 Solution in this Paper:
→ Applied SAEs to analyze 4 key transformer blocks in SDXL Turbo's denoising U-net
→ Created SDLens library to capture and manipulate intermediate results during image generation
→ Trained SAEs on 1.5M LAION-COCO prompts to decompose feature maps into interpretable components
→ Developed visualization techniques: spatial activation heatmaps, feature modulation, empty context activation
→ Built automated feature annotation pipeline using GPT-4V
-----
💡 Key Insights:
→ Different transformer blocks have specialized roles:
- down.2.1: Controls overall image composition
- up.0.0: Adds fine-grained local details
- up.0.1: Manages color, illumination and style
- mid.0: Handles spatial relationships
→ Features learned are highly interpretable and causally influence generation
→ Features show high specificity (0.71 for down.2.1) compared to random baseline (0.50)
→ up.0.1 features have strongest texture effects (0.20 score vs 0.18 baseline)
-----
📊 Results:
→ Feature specificity scores significantly higher than random baseline across all blocks
→ Causality analysis shows strong feature effects (0.19 CLIP similarity vs 0.21 ground truth)
→ Local intervention tests confirm specialized roles of different blocks
→ Color sensitivity analysis validates style/color specialization of up.0.1 block
Share this post