Evolution-based neural network that cleans up transformer's memory while boosting its performance.
Neural Attention Memory Models (NAMMs) introduce a learned network that improves transformer performance while reducing memory usage by intelligently managing the Key-Value cache.
-----
https://arxiv.org/abs/2410.13166
🤔 Original Problem:
LLMs face increasing costs and memory constraints with longer contexts, while current solutions that drop parts of contexts lead to performance degradation.
-----
🔧 Solution in this Paper:
→ NAMMs use evolution to optimize a neural network that decides which tokens to keep in transformer memory
→ The system analyzes attention patterns to determine token importance across different layers
→ NAMMs extract features from attention spectrograms, making them universally applicable to any transformer model
→ The solution employs backward attention memory (BAM) architecture for efficient information sharing between tokens
→ Training happens incrementally on small sets of problems using CMA-ES optimization
-----
💡 Key Insights:
→ Memory management can improve model performance, not just efficiency
→ Token importance varies across transformer layers
→ Universal feature extraction enables zero-shot transfer across architectures
→ Evolution can effectively optimize non-differentiable memory operations
-----
📊 Results:
→ 11% performance improvement on LongBench while using 75% less memory
→ Zero-shot transfer to vision tasks with 1% gain and 28% memory reduction
→ 9% improvement in reinforcement learning tasks with 19% memory savings
Share this post