Vision Transformers don't need pre-trained features - just knowing where to look is enough.
In other words, teaching vision models to pay attention beats teaching them what to see
This paper challenges the common belief that pre-training Vision Transformers (ViT) works by learning useful features. Instead, it shows that just transferring attention patterns from a pre-trained model is enough to achieve comparable performance.
-----
https://arxiv.org/abs/2411.09702
🤔 Original Problem:
Pre-training has been the go-to approach for vision models, with the assumption that it helps learn useful features. But this paper questions if that's really the whole story.
-----
🔧 Solution in this Paper:
→ The paper introduces "attention transfer" - a method to only copy attention patterns from a pre-trained ViT to a new model.
→ Two approaches are proposed: Attention Copy directly copies attention maps, while Attention Distillation teaches a new model to match the teacher's attention patterns.
→ The student model learns its own features from scratch while using the teacher's routing patterns.
-----
💡 Key Insights:
→ Pre-trained features aren't essential - attention patterns alone can achieve similar performance
→ Later layers' attention patterns are more important than earlier ones
→ Performance improves with more attention heads but saturates at 12 out of 16 heads
→ The method works best when pre-training and downstream datasets are similar
-----
📊 Results:
→ Attention Copy achieves 85.1% accuracy on ImageNet-1K, recovering 77.8% of the gap between scratch training (83.0%) and fine-tuning (85.7%)
→ Attention Distillation matches fine-tuning performance at 85.7%
→ Ensemble with teacher model improves accuracy to 86.3%
Share this post