"On the Surprising Effectiveness of Attention Transfer for Vision Transformers"

Playback speed

Share post at current time

0:00

Transcript

"On the Surprising Effectiveness of Attention Transfer for Vision Transformers"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 31, 2024

Vision Transformers don't need pre-trained features - just knowing where to look is enough.

In other words, teaching vision models to pay attention beats teaching them what to see

This paper challenges the common belief that pre-training Vision Transformers (ViT) works by learning useful features. Instead, it shows that just transferring attention patterns from a pre-trained model is enough to achieve comparable performance.

-----

https://arxiv.org/abs/2411.09702

🤔 Original Problem:

Pre-training has been the go-to approach for vision models, with the assumption that it helps learn useful features. But this paper questions if that's really the whole story.

-----

🔧 Solution in this Paper:

→ The paper introduces "attention transfer" - a method to only copy attention patterns from a pre-trained ViT to a new model.

→ Two approaches are proposed: Attention Copy directly copies attention maps, while Attention Distillation teaches a new model to match the teacher's attention patterns.

→ The student model learns its own features from scratch while using the teacher's routing patterns.

-----

💡 Key Insights:

→ Pre-trained features aren't essential - attention patterns alone can achieve similar performance

→ Later layers' attention patterns are more important than earlier ones

→ Performance improves with more attention heads but saturates at 12 out of 16 heads

→ The method works best when pre-training and downstream datasets are similar

-----

📊 Results:

→ Attention Copy achieves 85.1% accuracy on ImageNet-1K, recovering 77.8% of the gap between scratch training (83.0%) and fine-tuning (85.7%)

→ Attention Distillation matches fine-tuning performance at 85.7%

→ Ensemble with teacher model improves accuracy to 86.3%

Rohan's Bytes

"On the Surprising Effectiveness of Attention Transfer for Vision Transformers"

Discussion about this video