"Scalable-Softmax Is Superior for Attention"

Below podcast on this paper is generated with Google's Illuminate.

Feb 08, 2025

→ Softmax in Transformer attention layers causes attention fading as context size grows.

→ Attention fading reduces the model's ability to focus on key information in long contexts.

This paper introduces Scalable-Softmax (SSMax) to fix attention fading in Transformer LLMs. Attention fading occurs because standard Softmax flattens attention distribution as context length increases, hindering focus on key information in long contexts.

→ SSMax can be easily integrated into existing Transformer architectures with minimal code changes.

-----

https://arxiv.org/abs/2501.19399

📌 SSMax directly tackles the softmax's inherent flaw in long contexts. By scaling with input size 'n', it maintains attention sharpness, preventing information loss in lengthy sequences. This is a crucial architectural improvement.

📌 The elegance of SSMax lies in its minimal integration cost. A simple scalar multiplication in the attention calculation yields significant gains in long-context tasks. This plug-and-play nature is highly practical for immediate adoption.

📌 Empirically, SSMax models show tangible benefits: lower loss, better long-context generalization, and superior key information retrieval. These results strongly validate SSMax as a robust and effective softmax replacement for current LLMs.

----------

Key Insights 💡:

→ Experiments showed that learned parameters in modified Softmax formulation follow a logarithmic relationship with input vector size.

→ This observation motivated the design of SSMax with 'log n' component.

→ SSMax enables models to maintain focus on key tokens even in long contexts, unlike standard Softmax.

→ Models using SSMax from the start achieve better length generalization, but replacing Softmax with SSMax later also provides improvement.

-----

Results 📊:

→ SSMax models achieved approximately 0.008 lower training loss compared to standard Softmax models.

→ SSMax models maintained lower test loss in context sizes up to 20,000, while Softmax model's loss significantly increased.

→ In Needle-In-A-Haystack test, SSMax model retrieved key information accurately even at 10x training context length, while Softmax model failed.

→ Attention score analysis showed SSMax models allocate significantly higher attention scores to key tokens compared to Softmax models.

Rohan's Bytes

Discussion about this post