SELECTIVE ATTENTION IMPROVES TRANSFORMER

Playback speed

Share post at current time

0:00

Transcript

Generated this podcast with Google's Illuminate.

Jan 02, 2025

Selective Attention improves LLM performance across model sizes and context lengths

Performs equivalently to standard transformers with ~2X more heads and parameters in attention modules

Original Problem 🔍:

Unneeded elements in attention's context degrade performance. Standard transformers keep entire history, leading to inefficiency.

-----

Solution in this Paper 💡:

• Introduces Selective Attention: allows tokens to reduce attention to unneeded elements

• Parameter-free change to standard attention mechanism

• Computes soft-mask matrix S, constrains it, accumulates into F

• Subtracts F from attention logits before softmax

• Reuses existing attention head for selection function

• Applies constraints: non-negative, no masking of <BOS>, no self-masking

-----

Key Insights from this Paper 💡:

• Enables context pruning, reducing memory and compute requirements during inference

• Outperforms local attention patterns and standard transformers

• Sparsity patterns sometimes stable across training runs, hinting at general language modeling properties

-----

Results 📊:

• 100M parameter models on C4 with context sizes 512, 1024, 2048 need 16X, 25X, 47X less memory for attention module

• Consistent improvements on HellaSwag benchmark across model sizes

• Generalizes better on Variable Assignment task compared to standard transformers

Selective Attention enables transformers to dynamically adjust context, improving efficiency and performance in language modeling tasks.

Rohan's Bytes