0:00
/
0:00
Transcript

"Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models"

The podcast on this paper is generated with Google's Illuminate.

Using text to guide visual attention makes CLIP more robust to adversarial examples

Aligning attention between clean and adversarial images improves model robustness

📚 https://arxiv.org/abs/2410.21802

🎯 Original Problem:

Pre-trained vision-language models like CLIP are vulnerable to adversarial attacks where small image perturbations cause misclassification. Current solutions either sacrifice clean accuracy or provide limited robustness improvement.

-----

🔧 Solution in this Paper:

→ Introduces TGA-ZSR (Text-Guided Attention for Zero-Shot Robustness) with two key components:

→ Attention Refinement module: Aligns text-guided attention from adversarial examples with clean examples

→ Attention-based Model Constraint module: Maintains clean sample performance while improving robustness

→ Uses text embeddings from frozen text encoder to guide visual attention during adversarial training

-----

💡 Key Insights:

→ Adversarial perturbations cause shifts in text-guided attention patterns

→ Text guidance can help filter irrelevant information during adversarial attacks

→ Simple attention alignment between clean and adversarial samples improves robustness

-----

📊 Results:

→ 9.58% improvement in zero-shot robust accuracy over state-of-the-art across 16 datasets

→ Maintains high clean accuracy (56.48%) while achieving 41.96% robust accuracy

→ Effective against multiple attack types: PGD, AutoAttack, and CW attacks

Discussion about this video