0:00
/
0:00
Transcript

"On the Surprising Effectiveness of Attention Transfer for Vision Transformers"

The podcast on this paper is generated with Google's Illuminate.

Vision Transformers don't need pre-trained features - just knowing where to look is enough.

In other words, teaching vision models to pay attention beats teaching them what to see

This paper challenges the common belief that pre-training Vision Transformers (ViT) works by learning useful features. Instead, it shows that just transferring attention patterns from a pre-trained model is enough to achieve comparable performance.

-----

https://arxiv.org/abs/2411.09702

🤔 Original Problem:

Pre-training has been the go-to approach for vision models, with the assumption that it helps learn useful features. But this paper questions if that's really the whole story.

-----

🔧 Solution in this Paper:

→ The paper introduces "attention transfer" - a method to only copy attention patterns from a pre-trained ViT to a new model.

→ Two approaches are proposed: Attention Copy directly copies attention maps, while Attention Distillation teaches a new model to match the teacher's attention patterns.

→ The student model learns its own features from scratch while using the teacher's routing patterns.

-----

💡 Key Insights:

→ Pre-trained features aren't essential - attention patterns alone can achieve similar performance

→ Later layers' attention patterns are more important than earlier ones

→ Performance improves with more attention heads but saturates at 12 out of 16 heads

→ The method works best when pre-training and downstream datasets are similar

-----

📊 Results:

→ Attention Copy achieves 85.1% accuracy on ImageNet-1K, recovering 77.8% of the gap between scratch training (83.0%) and fine-tuning (85.7%)

→ Attention Distillation matches fine-tuning performance at 85.7%

→ Ensemble with teacher model improves accuracy to 86.3%

Discussion about this video