0:00
/
0:00
Transcript

"More Expressive Attention with Negative Weights"

The podcast on this paper is generated with Google's Illuminate.

When ignoring information is just as important as paying attention to it.

Cog Attention: When sometimes the best attention is negative attention

Negative attention weights enable transformers to selectively delete irrelevant information while preserving useful context.

With Cog Attention, we can break free from softmax constraints: Transformers can now say "no" to irrelevant information

https://arxiv.org/abs/2411.07176

🎯 Original Problem:

Traditional transformer attention mechanisms only allow non-negative weights through softmax, limiting expressiveness and causing over-squashing of information from earlier tokens into later positions.

-----

🔧 Solution in this Paper:

→ Introduces Cog Attention that enables negative attention weights through a novel normalization approach using SignExp function and absolute value summation

→ Shifts token deletion/copying from static OV matrix to dynamic query-key products, allowing more flexible token processing

→ Maintains numerical stability by subtracting maximum absolute values and normalizing with sum of absolute values

→ Preserves softmax attention in first and last layers for better convergence

-----

💡 Key Insights:

→ Negative weights allow simultaneous deletion, copying, or retention of tokens within single attention head

→ Reduces over-squashing by limiting information paths from earlier to later tokens

→ More robust against representational collapse compared to standard softmax attention

→ Achieves better performance without additional parameters

-----

📊 Results:

→ Outperformed standard Transformer on 7/8 language tasks with 46.16% vs 45.24% average accuracy

→ Better FID scores on image generation: CIFAR-10 (3.27 vs 3.39) and MS-COCO (5.85 vs 5.99)

→ Maintained same convergence rate as vanilla Transformer

Discussion about this video

User's avatar