"SliderSpace: Decomposing the Visual Capabilities of Diffusion Models"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.01639
Text-to-image diffusion models generate diverse images from a single prompt, but users lack intuitive control over this variation. This paper addresses the challenge of discovering and exposing the hidden visual capabilities within diffusion models.
This paper introduces SliderSpace, a framework for unsupervised discovery of controllable directions within diffusion models. It decomposes the model's visual capabilities into semantically meaningful and orthogonal directions, enabling users to explore diverse image variations.
-----
📌 SliderSpace effectively leverages CLIP's semantic space to disentangle diffusion models' latent representations. PCA on CLIP embeddings reveals orthogonal directions, guiding LoRA training for targeted and interpretable controls.
📌 The unsupervised nature of SliderSpace is a key strength. By discovering directions directly from model outputs, it avoids reliance on pre-defined attributes, unlocking truly emergent and diverse variations.
📌 By training low-rank adaptors, SliderSpace achieves efficient control without full fine-tuning. This allows for compositional manipulation and rapid exploration of diffusion models' generative capacity with minimal overhead.
----------
Methods Explored in this Paper 🔧:
→ SliderSpace discovers controllable directions in diffusion models using a three-step process.
→ First, it samples a diverse set of images for a given prompt by varying random seeds.
→ Then, it extracts CLIP embeddings of these generated images and performs Principal Component Analysis (PCA) to find principal directions of variation in the semantic space.
→ Finally, for each principal direction, it trains a Low-Rank Adaptor (LoRA) to control image generation along that direction.
→ The training objective is to align the transformation induced by each LoRA with its corresponding principal component in the CLIP embedding space, ensuring semantic orthogonality between sliders.
→ This approach enables unsupervised discovery of semantically meaningful and diverse control directions without predefined attributes.
-----
Key Insights 💡:
→ SliderSpace uncovers interpretable visual attributes within diffusion models, revealing the model's internal concept representation.
→ The discovered directions are semantically orthogonal and controllable, allowing users to manipulate specific image attributes independently.
→ SliderSpace enhances the diversity of generated images without compromising text-image alignment.
→ The method is architecture-agnostic and applicable to various diffusion models, including distilled and transformer-based models.
-----
Results 📊:
→ SliderSpace significantly increases inter-image diversity compared to baseline models, as measured by DreamSim distance. For example, for the concept "Monster", diversity increased from 0.266 to 0.472.
→ SliderSpace maintains comparable CLIP scores to baseline models, indicating preserved text alignment. For example, for the concept "Monster", CLIP score remained at 28.14 with SliderSpace.
→ User studies show that SliderSpace-generated images are consistently preferred over baselines for diversity, utility and creative potential. For diversity, SliderSpace had a 72.4% win rate against SDXL-DMD.
→ In art style exploration, SliderSpace achieves a lower FID score of 21.82 compared to LLM prompt baselines (38.84), indicating better alignment with manually curated artist styles.
→ SliderSpace improves the FID score of distilled models like SDXL-DMD from 15.52 to 12.12, approaching the performance of the undistilled SDXL model (11.72), demonstrating diversity enhancement and mode collapse reversal.