0:00
/
0:00
Transcript

"Slicing Vision Transformer for Flexible Inference"

The podcast on this paper is generated with Google's Illuminate.

This paper introduces Scala, a framework that enables Vision Transformers (ViT) to dynamically scale down their size while maintaining performance.

It solves the challenge of adapting ViT models to resource-constrained environments by allowing flexible inference through network slicing.

-----

https://arxiv.org/abs/2412.04786

🎯 Original Problem:

While ViTs excel in scalability, they lack flexibility to adapt to dynamic resource constraints in real-world scenarios. Current solutions either train multiple separate models or use inefficient slicing methods that significantly degrade performance.

-----

🛠️ Solution in this Paper:

→ Scala enables a single ViT to represent multiple smaller variants through uniform weight matrix slicing at each layer.

→ It introduces Isolated Activation to disentangle the smallest sub-network's representation from other subnets while preserving performance.

→ Scale Coordination ensures each subnet receives simplified, steady, and accurate learning objectives through Progressive Knowledge Transfer.

→ The framework requires only one-shot training without modifying the original ViT structure.

-----

💡 Key Insights:

→ Smaller ViTs are intrinsically sub-networks of larger ones with different widths

→ ViTs display minimal interpolation ability compared to CNNs

→ Constant activation of smallest subnet negatively impacts other subnets' performance

→ Progressive knowledge transfer improves optimization of smaller networks

-----

📊 Results:

→ Achieves 1.6% average improvement on ImageNet-1K with fewer parameters

→ Matches Separate Training performance while reducing storage costs

→ Successfully transfers to downstream tasks like semantic segmentation

→ Outperforms prior methods at small computational budgets

Discussion about this video

User's avatar