What if self-attention isn’t the end-all be-all? 🤯
Concerning information loss in transformers, this Paper proposes an interesting alternative. Replacing attention with masked convolutions
📚 https://arxiv.org/pdf/2409.01482
Original Problem 🔍:
Attention mechanisms in language models discard most input information, potentially limiting their effectiveness for tasks requiring detailed input preservation.
-----
Key Insights from this Paper 💡:
• Transformers exhibit poor input representation accuracy in deeper layers
• Masked mixers with convolutions retain more accurate input representations
• Masked mixers learn causal language modeling more efficiently than early transformers
• Transformer-masked mixer hybrids are most efficient for language modeling
• Masked mixer embeddings outperform transformer embeddings in retrieval tasks
-----
Solution in this Paper 🛠️:
• Replace self-attention with masked 1D convolutions
• Apply triangular masking to convolution weights for causal language modeling
• Introduce bidirectional mixer architecture for retrieval modeling
• Train retrieval models on fixed embeddings from pretrained generative models
----
Results 📊:
• Masked mixers: 1.82 validation loss vs 1.91 for transformers (12-hour training)
• Transformer-mixer hybrid: 1.59 validation loss
• 128-context retrieval:
- Masked mixer embeddings: 0.87 cross-entropy loss, 85.8% top-1 accuracy
- Transformer embeddings: 2.28 cross-entropy loss, 43.2% top-1 accuracy
📚 https://arxiv.org/pdf/2409.01482
------
Are you into AI and LLMs❓ Join me on Twitter with 30.8K others, to remain on the bleeding-edge every day.
𝕏/🐦 https://x.com/rohanpaul_ai
Share this post