The paper addresses the challenge of consistent character generation in text-to-image models for storytelling. It introduces a novel method that achieves consistency without extra training or complex architectures.
-----
Paper - https://arxiv.org/abs/2501.13554
Original Problem 😟:
→ Current text-to-image models struggle to maintain consistent character identity across multiple images for storytelling.
→ Existing methods for consistent generation require extensive training or model modifications.
→ These training-heavy methods limit applicability and introduce risks like language drift.
-----
Solution in this Paper 🤔:
→ This paper proposes "One-Prompt-One-Story", a training-free approach for consistent text-to-image generation.
→ Thier method uses a single prompt combining an identity description and frame descriptions.
→ It leverages the inherent "context consistency" of language models.
→ The method refines generation with "Singular-Value Reweighting" (promptweighting) and "Identity-Preserving Cross-Attention" (crossattnconsist).
→ promptweighting enhances the current frame prompt and weakens others by reweighting embeddings using Singular Value Decomposition.
→ crossattnconsist strengthens identity consistency in cross-attention layers by focusing on the identity prompt.
-----
Key Insights from this Paper 💡:
→ Language models inherently understand identity through context within a single prompt.
→ Concatenating prompts into one can initially preserve character identities.
→ Reweighting prompt embeddings and refining cross-attention further improves consistency and text-image alignment.
→ Training-free methods can achieve strong consistent generation by exploiting language model properties.
Share this post