mart attention that knows when to zoom in and when to zoom out.
It's like teaching AI to use a spotlight instead of floodlights.
This paper introduces Selective Self-Attention (SSA), enhancing transformer models by adding temperature scaling to control attention sparsity and relevance dynamically based on query embeddings and positions.
-----https://arxiv.org/abs/2411.12892
🤔 Original Problem:
→ Standard transformer attention treats all queries uniformly, making it difficult to control contextual sparsity and relevance for different types of tokens.
→ Current models struggle with attention dilution in longer sequences and have trouble suppressing irrelevant tokens.
-----
🔧 Solution in this Paper:
→ SSA introduces a temperature scaling mechanism that adapts to both query embeddings and token positions.
→ The method applies different temperatures to query and value embeddings through a learnable function.
→ Position-aware scaling helps mitigate attention dilution in longer sequences using logarithmic scaling.
→ Weight sharing strategy reduces parameter overhead to <0.5% while maintaining benefits.
-----
💡 Key Insights:
→ Temperature scaling helps decouple semantic similarity from contextual sparsity
→ Position-aware scaling effectively counters attention dilution
→ Value temperature scaling enhances model's ability to suppress irrelevant tokens
→ SSA improves training efficiency, achieving similar performance with 1.45x fewer steps
-----
📊 Results:
→ Consistent improvements across GPT-2, Pythia, Llama, and Llama3 models
→ Enhanced passkey retrieval performance from 56.9% to 74.4%
→ Reduced parameter overhead to <0.5% through weight sharing
→ Improved perplexity scores across multiple benchmarks