What if self-attention isnβt the end-all be-all? π€―
Concerning information loss in transformers, this Paper proposes an interesting alternative. Replacing attention with masked convolutions
π https://arxiv.org/pdf/2409.01482
Original Problem π:
Attention mechanisms in language models discard most input information, potentially limiting their effectiveness for tasks requiring detailed input preservation.
-----
Key Insights from this Paper π‘:
β’ Transformers exhibit poor input representation accuracy in deeper layers
β’ Masked mixers with convolutions retain more accurate input representations
β’ Masked mixers learn causal language modeling more efficiently than early transformers
β’ Transformer-masked mixer hybrids are most efficient for language modeling
β’ Masked mixer embeddings outperform transformer embeddings in retrieval tasks
-----
Solution in this Paper π οΈ:
β’ Replace self-attention with masked 1D convolutions
β’ Apply triangular masking to convolution weights for causal language modeling
β’ Introduce bidirectional mixer architecture for retrieval modeling
β’ Train retrieval models on fixed embeddings from pretrained generative models
----
Results π:
β’ Masked mixers: 1.82 validation loss vs 1.91 for transformers (12-hour training)
β’ Transformer-mixer hybrid: 1.59 validation loss
β’ 128-context retrieval:
- Masked mixer embeddings: 0.87 cross-entropy loss, 85.8% top-1 accuracy
- Transformer embeddings: 2.28 cross-entropy loss, 43.2% top-1 accuracy
π https://arxiv.org/pdf/2409.01482
------
Are you into AI and LLMsβ Join me on Twitter with 30.8K others, to remain on the bleeding-edge every day.
π/π¦ https://x.com/rohanpaul_ai