This paper introduces Scala, a framework that enables Vision Transformers (ViT) to dynamically scale down their size while maintaining performance.
It solves the challenge of adapting ViT models to resource-constrained environments by allowing flexible inference through network slicing.
-----
https://arxiv.org/abs/2412.04786
🎯 Original Problem:
While ViTs excel in scalability, they lack flexibility to adapt to dynamic resource constraints in real-world scenarios. Current solutions either train multiple separate models or use inefficient slicing methods that significantly degrade performance.
-----
🛠️ Solution in this Paper:
→ Scala enables a single ViT to represent multiple smaller variants through uniform weight matrix slicing at each layer.
→ It introduces Isolated Activation to disentangle the smallest sub-network's representation from other subnets while preserving performance.
→ Scale Coordination ensures each subnet receives simplified, steady, and accurate learning objectives through Progressive Knowledge Transfer.
→ The framework requires only one-shot training without modifying the original ViT structure.
-----
💡 Key Insights:
→ Smaller ViTs are intrinsically sub-networks of larger ones with different widths
→ ViTs display minimal interpolation ability compared to CNNs
→ Constant activation of smallest subnet negatively impacts other subnets' performance
→ Progressive knowledge transfer improves optimization of smaller networks
-----
📊 Results:
→ Achieves 1.6% average improvement on ImageNet-1K with fewer parameters
→ Matches Separate Training performance while reducing storage costs
→ Successfully transfers to downstream tasks like semantic segmentation
→ Outperforms prior methods at small computational budgets
Share this post