This new attention mechanism addresses Softmax limitations and improves LLM scalability.
This paper introduces a new attention mechanism, Length Scaled Softplus Attention (LSSA), and a re-weighting technique to improve LLM performance and length extrapolation.
-----
Paper - https://arxiv.org/abs//2501.13428
Original Problem 😞:
→ Standard Softmax attention in LLMs suffers from numerical instability and reduced performance with increasing inference token lengths.
-----
Solution in this Paper 😎:
→ The paper decomposes Softmax into a non-linear transformation and l1-norm. It identifies the l1-norm as crucial for performance.
→ It proposes replacing the exponential function with the Softplus activation function. This creates Length Scaled Softplus Attention (LSSA).
→ LSSA also incorporates a dynamic length scale factor (log d log N). This addresses limitations of traditional attention methods with long sequences.
→ To further refine LSSA, a re-weighting mechanism is introduced (LSSAR). This amplifies significant attention weights and diminishes weaker ones, enabling better focus on relevant tokens.
-----
Key Insights from this Paper 🤔:
→ Non-negative attention scores are not essential for LLM performance, but normalization is.
→ Softplus activation, combined with dynamic length scaling and re-weighting, enhances performance and length extrapolation capabilities.
-----
Results 📈:
→ LSSA outperforms standard Softmax attention at training sequence length and longer sequences.
→ LSSAR maintains nearly constant validation loss even at 16x the training token length, demonstrating improved length extrapolation and stability.
→ At an inference length of 1024, LSSA achieves a validation loss of 3.1905 compared to Softmax's 3.1911. At 16x that length (16384), LSSAR with p=15 maintains a loss of 3.3171 compared to Softmax's 7.0183.
Share this post