Quite a breakthrough Paper for transformer architecture from @Microsoft. 👏
"DIFFERENTIAL TRANSFORMER" ✨
Differential attention maps now subtract two distinct softmax outputs. This subtraction removes attention noise and pushes the model toward sparse attention.
📚 https://arxiv.org/pdf/2410.05258
Net effect:
• Sharper retrieval and lower hallucination rates. 🏆
• Outperforms standard Transformers while using 35-40% fewer parameters or training tokens
• 10-20% accuracy gain in many-shot in-context learning across datasets
• 7-11% reduction in hallucination for summarization and question answering
• Maintains performance with 6-bit quantization, while Transformer degrades significantly
Original Problem 🔍:
Transformer tends to overallocate attention to irrelevant context, leading to challenges in accurately retrieving key information.
-----
Solution in this Paper 💡:
• Introduces DIFF Transformer with differential attention mechanism
• Calculates attention scores as difference between two separate softmax attention maps
• Subtraction cancels noise, promoting emergence of sparse attention patterns
• Amplifies attention to relevant context while reducing attention to irrelevant parts
• Uses GroupNorm to normalize each attention head independently
-----
Key Insights from this Paper 💡:
• DIFF Transformer outperforms Transformer in scaling model size and training tokens
• Requires only ~65% of model size or training tokens to match Transformer performance
• Excels in long-context modeling, key information retrieval, and in-context learning
• Mitigates hallucination in question answering and text summarization
• Reduces outliers in model activations, enabling better quantization
Share this post