UniForm makes Vision Transformers edge-friendly by cleverly reusing attention computations across heads.
UniForm introduces a Reuse Attention mechanism that significantly reduces memory and computational demands of Vision Transformers by sharing attention matrices across heads, making them efficient for edge devices while maintaining high accuracy.
-----
https://arxiv.org/abs/2412.02344
🔍 Original Problem:
→ Vision Transformers excel in computer vision tasks but their high memory and computational demands make them impractical for edge devices with limited resources.
→ Traditional multi-head attention redundantly computes separate attention matrices for each head, causing significant memory overhead and slow inference on resource-constrained hardware.
-----
🛠️ Solution in this Paper:
→ UniForm consolidates attention computations into a shared attention matrix across all heads within a layer.
→ The architecture implements multi-scale value processing where each head's value projection undergoes unique kernel-sized depthwise convolutions.
→ Memory efficiency is achieved by reusing the unified attention matrix for all heads, eliminating redundant computations.
→ The model follows a progressive design with three stages, incrementally increasing channel dimensions, depth, and attention heads.
-----
💡 Key Insights:
→ Query and Key components show high redundancy across attention heads
→ Value projections encode more crucial information than Query/Key projections
→ Multi-scale processing enhances feature diversity without increasing memory demands
→ Memory bandwidth is the primary bottleneck for edge deployment
-----
📊 Results:
→ UniForm-l achieves 76.7% Top-1 accuracy on ImageNet-1K with 21.8ms inference on Jetson AGX Orin
→ Demonstrates 5x speedup over competing methods on edge devices
→ Reduces memory movement by up to 93.94% compared to standard multi-head attention