Transformer attention reimagined: Element-wise computation beats dot products at their own game
Element-wise attention replaces traditional dot product attention with squared Euclidean distance computation, making transformer models faster and more memory efficient.
-----
https://arxiv.org/abs/2501.05730
Original Problem 🤔:
Self-attention mechanisms in transformers have quadratic complexity in both training and inference, making them inefficient for long sequences. Current solutions like linear attention and RNNs compromise performance.
-----
Solution in this Paper 💡:
→ The paper introduces element-wise attention that computes similarity using squared Euclidean distance instead of dot products
→ It approximates the exponential term using Taylor polynomials to achieve linear complexity
→ The mechanism can be reformulated as RNNs during inference for constant memory usage
→ Training complexity reduces to O(tLD) where t is Taylor polynomial order, L is sequence length, D is feature dimension
→ Inference complexity becomes O(tD), independent of sequence length
-----
Key Insights 🔍:
→ Element-wise operations preserve "spikiness" property that linear attention loses
→ Higher-order Taylor approximations improve performance while maintaining efficiency
→ Memory usage scales linearly with sequence length unlike quadratic scaling in self-attention
→ Constant-size caches during inference enable efficient long-sequence processing
-----
Results 📊:
→ EA-6 outperforms standard self-attention on multiple time series datasets
→ Memory usage remains constant with sequence length during inference
→ Maintains high throughput even with longer sequences while self-attention degrades
→ Achieves comparable or better accuracy on both causal and non-causal tasks
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post