Selective Attention improves LLM performance across model sizes and context lengths
Performs equivalently to standard transformers with ~2X more heads and parameters in attention modules
📚 https://arxiv.org/pdf/2410.02703
Original Problem 🔍:
Unneeded elements in attention's context degrade performance. Standard transformers keep entire history, leading to inefficiency.
-----
Solution in this Paper 💡:
• Introduces Selective Attention: allows tokens to reduce attention to unneeded elements
• Parameter-free change to standard attention mechanism
• Computes soft-mask matrix S, constrains it, accumulates into F
• Subtracts F from attention logits before softmax
• Reuses existing attention head for selection function
• Applies constraints: non-negative, no masking of <BOS>, no self-masking
-----
Key Insights from this Paper 💡:
• Enables context pruning, reducing memory and compute requirements during inference
• Outperforms local attention patterns and standard transformers
• Sparsity patterns sometimes stable across training runs, hinting at general language modeling properties
-----
Results 📊:
• 100M parameter models on C4 with context sizes 512, 1024, 2048 need 16X, 25X, 47X less memory for attention module
• Consistent improvements on HellaSwag benchmark across model sizes
• Generalizes better on Variable Assignment task compared to standard transformers
Selective Attention enables transformers to dynamically adjust context, improving efficiency and performance in language modeling tasks.
Share this post