Why use heavy transformers when convolutions can do the job better?
ConvMixFormer replaces transformer's attention with convolutions to make gesture recognition faster and lighter
https://arxiv.org/abs/2411.07118
Original Problem 🎯:
Traditional transformers use quadratic-scaling attention mechanisms that become computationally heavy for long sequences in dynamic hand gesture recognition, making them resource-intensive and slow.
-----
Solution in this Paper 🛠️:
→ Introduces ConvMixFormer, replacing self-attention with simple convolutional layer-based token mixer
→ Uses standard convolution operations with batch normalization to mix spatial tokens
→ Implements Gated Depthwise Feed Forward Network (GDFN) to control information flow between stages
→ Integrates with ResNet-18 for feature extraction
→ Employs multi-modal late fusion approach for combining different input modalities
-----
Key Insights from this Paper 💡:
→ Convolution-based token mixing captures local spatial features more efficiently than self-attention
→ Gated mechanisms improve feature selection and noise suppression
→ Late fusion of multiple modalities (RGB, depth, IR) enhances recognition accuracy
→ Model achieves comparable performance with nearly half the parameters
-----
Results 📊:
→ NVGesture dataset: 80.83% accuracy with single modality (depth), 85.49% with multi-modal inputs
→ Briareo dataset: 98.26% accuracy with single modality (color), 98.64% with multi-modal inputs
→ Parameters reduced to 13.57M from 24.30M in standard transformers
→ Lower MACs at 59.98G compared to 62.92G in standard transformers
Share this post