EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

Nov 09, 2024

A clever way to train open-source image models by learning from API-only models using vision-language feedback

Original Problem 🔍:

Text-to-image models with exceptional capabilities are often restricted to API access, limiting their widespread use.

Solution in this Paper 🛠️:

• EvolveDirector framework introduced to train open-source text-to-image models

• Uses advanced models' APIs to obtain training images

• Leverages Vision-Language Models (VLMs) to guide dynamic dataset curation

• Implements discrimination, expansion, deletion, and mutation operations

• Incorporates layer normalization after Q and K projections in cross-attention blocks

Key Insights from this Paper 💡:

• Generation abilities can be approximated through training on generated data

• VLMs significantly reduce required data volume for efficient training

• Learning from multiple advanced models can surpass individual model performance

Results 📊:

• 100k training samples sufficient for base model to match target model performance

• Edgen (final trained model) outperforms DeepFloyd IF, Playground 2.5, Stable Diffusion 3, and Ideogram

• Demonstrates superior capabilities in human generation, text generation, and multi-object generation

🧠 EvolveDirector uses a three-part process:

It interacts with advanced T2I models through their APIs to get training images.
It maintains a dynamic training set guided by a Vision-Language Model (VLM).
It trains a base model on this dynamic training set.

Rohan's Bytes