Attention Shift: Steering AI Away from Unsafe Content

Attention manipulation with Strategic token weighting guided by LLMs creates a robust safety net for image generation.

Nov 11, 2024

Attention manipulation with Strategic token weighting guided by LLMs creates a robust safety net for image generation.

Token tweaking + LLM gatekeeping = Safe AI art without the training headache

Original Problem 🚨:

Text-to-image diffusion models often generate unsafe or explicit content, particularly when prompted with certain inputs. Current safety filters are ineffective and vulnerable to jailbreak prompts.

Solution in this Paper 🛠️:

• Training-free approach using attention reweighing combined with LLM validation

• Uses Mistral-8x7B to validate if prompts are safe and modify unsafe ones

• Increases relative importance of tokens that enforce safety by reweighing them by factor of 10

• Dynamically adjusts cross-attention maps during inference to suppress unsafe content generation

• Works without additional training or computational overhead

Key Insights 💡:

• Existing safety filters rely too heavily on CLIP embedding vectors

• Models are vulnerable to "jailbreak" prompts that bypass safety mechanisms

• Current ablation methods struggle with semantically similar but unsafe concepts

• Dynamic attention reweighing is more effective than static filtering

• LLM validation adds an extra layer of safety checking

Results 📊:

• Outperforms existing methods (Concept Ablation, Forget-Me-Not, Safe Diffusion) in restricting explicit content

• Human evaluation shows 0/10 unsafe content generation for direct prompts

• Only 1-2/10 unsafe generations for jailbreak prompts

• Maintains good image quality with FID scores comparable to baseline

• CLIP scores show significant reduction in unsafe concept alignment

🛠️ The method works in two steps:

Uses Mistral-8x7B model to validate if prompts are safe and modify unsafe ones
Increases the relative importance of tokens that enforce safety by reweighing them by a factor of 10

Attention Reweighing

Rohan's Bytes

Discussion about this post