Using text to guide visual attention makes CLIP more robust to adversarial examples
Aligning attention between clean and adversarial images improves model robustness
📚 https://arxiv.org/abs/2410.21802
🎯 Original Problem:
Pre-trained vision-language models like CLIP are vulnerable to adversarial attacks where small image perturbations cause misclassification. Current solutions either sacrifice clean accuracy or provide limited robustness improvement.
-----
🔧 Solution in this Paper:
→ Introduces TGA-ZSR (Text-Guided Attention for Zero-Shot Robustness) with two key components:
→ Attention Refinement module: Aligns text-guided attention from adversarial examples with clean examples
→ Attention-based Model Constraint module: Maintains clean sample performance while improving robustness
→ Uses text embeddings from frozen text encoder to guide visual attention during adversarial training
-----
💡 Key Insights:
→ Adversarial perturbations cause shifts in text-guided attention patterns
→ Text guidance can help filter irrelevant information during adversarial attacks
→ Simple attention alignment between clean and adversarial samples improves robustness
-----
📊 Results:
→ 9.58% improvement in zero-shot robust accuracy over state-of-the-art across 16 datasets
→ Maintains high clean accuracy (56.48%) while achieving 41.96% robust accuracy
→ Effective against multiple attack types: PGD, AutoAttack, and CW attacks
Share this post