0:00
/
0:00
Transcript

SELECTIVE ATTENTION IMPROVES TRANSFORMER

Generated this podcast with Google's Illuminate.

Selective Attention improves LLM performance across model sizes and context lengths

Performs equivalently to standard transformers with ~2X more heads and parameters in attention modules

📚 https://arxiv.org/pdf/2410.02703

Original Problem 🔍:

Unneeded elements in attention's context degrade performance. Standard transformers keep entire history, leading to inefficiency.

-----

Solution in this Paper 💡:

• Introduces Selective Attention: allows tokens to reduce attention to unneeded elements

• Parameter-free change to standard attention mechanism

• Computes soft-mask matrix S, constrains it, accumulates into F

• Subtracts F from attention logits before softmax

• Reuses existing attention head for selection function

• Applies constraints: non-negative, no masking of <BOS>, no self-masking

-----

Key Insights from this Paper 💡:

• Enables context pruning, reducing memory and compute requirements during inference

• Outperforms local attention patterns and standard transformers

• Sparsity patterns sometimes stable across training runs, hinting at general language modeling properties

-----

Results 📊:

• 100M parameter models on C4 with context sizes 512, 1024, 2048 need 16X, 25X, 47X less memory for attention module

• Consistent improvements on HellaSwag benchmark across model sizes

• Generalizes better on Variable Assignment task compared to standard transformers

Selective Attention enables transformers to dynamically adjust context, improving efficiency and performance in language modeling tasks.

Discussion about this video

User's avatar