When fine-tuning text-to-image models removes safety filters, even with safe training data - this paper has a clever fix.
https://arxiv.org/abs/2412.00357
🔍 Original Problem:
→ Text-to-image models lose their safety alignment during fine-tuning, causing previously filtered harmful content to resurface
→ This happens even when using completely safe training datasets, creating serious risks for companies and users
-----
🛠️ Solution in this Paper:
→ The paper introduces "Modular LoRA" which separates safety and fine-tuning components
→ Safety filters are implemented as a separate Low-Rank Adaptation (LoRA) module
→ During fine-tuning, the safety LoRA is temporarily detached to prevent interference
→ After fine-tuning completes, the safety module is reattached for inference
-----
💡 Key Insights:
→ Standard fine-tuning can undo safety measures in as few as 1,500-2,000 steps
→ Larger training datasets cause more severe safety degradation
→ Safety breakdown happens in early fine-tuning stages
→ The fine-tuning process actively relearns restricted content rather than just forgetting filters
-----
📊 Results:
→ Modular LoRA outperforms traditional fine-tuning in maintaining safety alignment
→ NSFW content reduced from 20.02% to 8.92% compared to standard LoRA fine-tuning
→ Nude content reduced from 5.78% to 2.28%
Share this post