EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models
A clever way to train open-source image models by learning from API-only models using vision-language feedback
A clever way to train open-source image models by learning from API-only models using vision-language feedback
Original Problem ๐:
Text-to-image models with exceptional capabilities are often restricted to API access, limiting their widespread use.
Solution in this Paper ๐ ๏ธ:
โข EvolveDirector framework introduced to train open-source text-to-image models
โข Uses advanced models' APIs to obtain training images
โข Leverages Vision-Language Models (VLMs) to guide dynamic dataset curation
โข Implements discrimination, expansion, deletion, and mutation operations
โข Incorporates layer normalization after Q and K projections in cross-attention blocks
Key Insights from this Paper ๐ก:
โข Generation abilities can be approximated through training on generated data
โข VLMs significantly reduce required data volume for efficient training
โข Learning from multiple advanced models can surpass individual model performance
Results ๐:
โข 100k training samples sufficient for base model to match target model performance
โข Edgen (final trained model) outperforms DeepFloyd IF, Playground 2.5, Stable Diffusion 3, and Ideogram
โข Demonstrates superior capabilities in human generation, text generation, and multi-object generation
๐ง EvolveDirector uses a three-part process:
It interacts with advanced T2I models through their APIs to get training images.
It maintains a dynamic training set guided by a Vision-Language Model (VLM).
It trains a base model on this dynamic training set.



