"SliderSpace: Decomposing the Visual Capabilities of Diffusion Models"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 11, 2025

Article voiceover

0:00

-5:51

https://arxiv.org/abs/2502.01639

Text-to-image diffusion models generate diverse images from a single prompt, but users lack intuitive control over this variation. This paper addresses the challenge of discovering and exposing the hidden visual capabilities within diffusion models.

This paper introduces SliderSpace, a framework for unsupervised discovery of controllable directions within diffusion models. It decomposes the model's visual capabilities into semantically meaningful and orthogonal directions, enabling users to explore diverse image variations.

-----

📌 SliderSpace effectively leverages CLIP's semantic space to disentangle diffusion models' latent representations. PCA on CLIP embeddings reveals orthogonal directions, guiding LoRA training for targeted and interpretable controls.

📌 The unsupervised nature of SliderSpace is a key strength. By discovering directions directly from model outputs, it avoids reliance on pre-defined attributes, unlocking truly emergent and diverse variations.

📌 By training low-rank adaptors, SliderSpace achieves efficient control without full fine-tuning. This allows for compositional manipulation and rapid exploration of diffusion models' generative capacity with minimal overhead.

----------

Methods Explored in this Paper 🔧:

→ SliderSpace discovers controllable directions in diffusion models using a three-step process.

→ First, it samples a diverse set of images for a given prompt by varying random seeds.

→ Then, it extracts CLIP embeddings of these generated images and performs Principal Component Analysis (PCA) to find principal directions of variation in the semantic space.

→ Finally, for each principal direction, it trains a Low-Rank Adaptor (LoRA) to control image generation along that direction.

→ The training objective is to align the transformation induced by each LoRA with its corresponding principal component in the CLIP embedding space, ensuring semantic orthogonality between sliders.

→ This approach enables unsupervised discovery of semantically meaningful and diverse control directions without predefined attributes.

-----

Key Insights 💡:

→ SliderSpace uncovers interpretable visual attributes within diffusion models, revealing the model's internal concept representation.

→ The discovered directions are semantically orthogonal and controllable, allowing users to manipulate specific image attributes independently.

→ SliderSpace enhances the diversity of generated images without compromising text-image alignment.

→ The method is architecture-agnostic and applicable to various diffusion models, including distilled and transformer-based models.

-----

Results 📊:

→ SliderSpace significantly increases inter-image diversity compared to baseline models, as measured by DreamSim distance. For example, for the concept "Monster", diversity increased from 0.266 to 0.472.

→ SliderSpace maintains comparable CLIP scores to baseline models, indicating preserved text alignment. For example, for the concept "Monster", CLIP score remained at 28.14 with SliderSpace.

→ User studies show that SliderSpace-generated images are consistently preferred over baselines for diversity, utility and creative potential. For diversity, SliderSpace had a 72.4% win rate against SDXL-DMD.

→ In art style exploration, SliderSpace achieves a lower FID score of 21.82 compared to LLM prompt baselines (38.84), indicating better alignment with manually curated artist styles.

→ SliderSpace improves the FID score of distilled models like SDXL-DMD from 15.52 to 12.12, approaching the performance of the undistilled SDXL model (11.72), demonstrating diversity enhancement and mode collapse reversal.

Rohan's Bytes

Discussion about this post