"Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models"

Playback speed

Share post at current time

0:00

Transcript

"Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Jan 31, 2025

This new attention mechanism addresses Softmax limitations and improves LLM scalability.

This paper introduces a new attention mechanism, Length Scaled Softplus Attention (LSSA), and a re-weighting technique to improve LLM performance and length extrapolation.

-----

Paper - https://arxiv.org/abs//2501.13428

Original Problem 😞:

→ Standard Softmax attention in LLMs suffers from numerical instability and reduced performance with increasing inference token lengths.

-----

Solution in this Paper 😎:

→ The paper decomposes Softmax into a non-linear transformation and l1-norm. It identifies the l1-norm as crucial for performance.

→ It proposes replacing the exponential function with the Softplus activation function. This creates Length Scaled Softplus Attention (LSSA).

→ LSSA also incorporates a dynamic length scale factor (log d log N). This addresses limitations of traditional attention methods with long sequences.

→ To further refine LSSA, a re-weighting mechanism is introduced (LSSAR). This amplifies significant attention weights and diminishes weaker ones, enabling better focus on relevant tokens.

-----

Key Insights from this Paper 🤔:

→ Non-negative attention scores are not essential for LLM performance, but normalization is.

→ Softplus activation, combined with dynamic length scaling and re-weighting, enhances performance and length extrapolation capabilities.

-----

Results 📈:

→ LSSA outperforms standard Softmax attention at training sequence length and longer sequences.

→ LSSAR maintains nearly constant validation loss even at 16x the training token length, demonstrating improved length extrapolation and stability.

→ At an inference length of 1024, LSSA achieves a validation loss of 3.1905 compared to Softmax's 3.1911. At 16x that length (16384), LSSAR with p=15 maintains a loss of 3.3171 compared to Softmax's 7.0183.

Rohan's Bytes

"Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models"

Discussion about this video