Apple just gave us something more exciting than Iphone 16
Replace the traditional Softmax in Attention with a Sigmoid and a constant (not learned) scalar bias based on the sequence length.
Will give you a 17% inference kernel speed-up over FlashAttention-2 on H100 GPUs.
📚 https://arxiv.org/pdf/2409.04431
Original Problem 🔍:
Softmax attention in transformers has limitations, including potential focus on few features and computational inefficiency in efficient attention kernels.
-----
Key Insights from this Paper 💡:
• SigmoidAttn is a universal function approximator for seq-to-seq tasks
• SigmoidAttn has improved regularity compared to SoftmaxAttn
• Stabilizing large initial attention norms is crucial for successful training
• FLASHSIGMOID implementation offers significant inference speed-up
-----
Solution in this Paper 🛠️:
• Replace softmax with sigmoid activation in attention mechanism
• Introduce bias term b = -log(n) to mitigate large initial attention norms
• Implement FLASHSIGMOID, a hardware-aware, memory-efficient version
• Apply LayerScale and QK norm for improved stability
• Use ALiBi or RoPE positional embeddings for language modeling tasks
-----
Results 📊:
• SigmoidAttn matches SoftmaxAttn performance across various tasks:
- Language modeling: Comparable train/validation NLL at 85M and 1B scale
- Vision: Equivalent performance in supervised and self-supervised learning
- ASR: Similar WER on LibriSpeech and TED-LIUM v3 datasets
• Improved stability and generalization to longer sequences in some cases
-------
📚 https://arxiv.org/pdf/2409.04431
------
Are you into AI and LLMs❓ Join me on Twitter with 31.7K others, to remain on the bleeding-edge every day.
𝕏/🐦 https://x.com/rohanpaul_ai
Share this post