0:00
/
0:00
Transcript

"Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis"

The podcast on this paper is generated with Google's Illuminate.

No more AI image-generation fashion disasters - right accessories on right animals. Token Merging (ToMe), proposed in this paper, proposes to solve this problem.

Simple token combination solves the attribute confusion problem in AI image generation

Token Merging (ToMe), fixes mixed-up attributes in AI image generation by merging related text tokens

https://arxiv.org/abs/2411.07132

Original Problem 🤔:

Text-to-image models often fail at semantic binding - correctly associating objects with their attributes or related sub-objects. For example, when generating "a dog with a hat and a cat with sunglasses", models frequently mix up which accessory belongs to which animal.

-----

Solution in this Paper 🛠️:

→ Token Merging (ToMe) combines relevant tokens into a single composite token by adding their CLIP text embeddings

→ End Token Substitution (ETS) replaces end tokens to reduce semantic interference

→ Two auxiliary losses guide token updating: entropy loss ensures focused attention, while semantic binding loss maintains consistency

→ The method requires no LLMs, layout information, or additional training

-----

Key Insights from this Paper 💡:

→ Text embeddings exhibit semantic additivity - combining token embeddings preserves their collective meaning

→ End tokens [EOT] often contain entangled semantic information that can cause attribute confusion

→ Early denoising steps are crucial for determining object layouts and relationships

-----

Results 📊:

→ Outperforms existing methods on T2I-CompBench benchmark for attribute binding

→ Achieves superior results on GPT-4o benchmark for object binding

→ Shows particular effectiveness in complex scenarios with multiple objects/attributes

→ Requires no model fine-tuning or additional training data

Discussion about this video