"Scalable-Softmax Is Superior for Attention"
Below podcast on this paper is generated with Google's Illuminate.
→ Softmax in Transformer attention layers causes attention fading as context size grows.
→ Attention fading reduces the model's ability to focus on key information in long contexts.
This paper introduces Scalable-Softmax (SSMax) to fix attention fading in Transformer LLMs. Attention fading occurs because standard Softmax flattens attention distribution as context length increases, hindering focus on key information in long contexts.
→ SSMax can be easily integrated into existing Transformer architectures with minimal code changes.
-----
https://arxiv.org/abs/2501.19399
📌 SSMax directly tackles the softmax's inherent flaw in long contexts. By scaling with input size 'n', it maintains attention sharpness, preventing information loss in lengthy sequences. This is a crucial architectural improvement.
📌 The elegance of SSMax lies in its minimal integration cost. A simple scalar multiplication in the attention calculation yields significant gains in long-context tasks. This plug-and-play nature is highly practical for immediate adoption.
📌 Empirically, SSMax models show tangible benefits: lower loss, better long-context generalization, and superior key information retrieval. These results strongly validate SSMax as a robust and effective softmax replacement for current LLMs.
----------
Key Insights 💡:
→ Experiments showed that learned parameters in modified Softmax formulation follow a logarithmic relationship with input vector size.
→ This observation motivated the design of SSMax with 'log n' component.
→ SSMax enables models to maintain focus on key tokens even in long contexts, unlike standard Softmax.
→ Models using SSMax from the start achieve better length generalization, but replacing Softmax with SSMax later also provides improvement.
-----
Results 📊:
→ SSMax models achieved approximately 0.008 lower training loss compared to standard Softmax models.
→ SSMax models maintained lower test loss in context sizes up to 20,000, while Softmax model's loss significantly increased.
→ In Needle-In-A-Haystack test, SSMax model retrieved key information accurately even at 10x training context length, while Softmax model failed.
→ Attention score analysis showed SSMax models allocate significantly higher attention scores to key tokens compared to Softmax models.