MemLong: Memory-Augmented Retrieval for Long Text Modeling
MemLong: Your LLM's external hard drive for handling massive documents.
MemLong: Your LLM's external hard drive for handling massive documents.
MemLong stores past context in memory banks, letting LLMs handle 80k tokens on a single GPU 💡
External memory retrieval helps MemLong process 20x more context. ✨
Solution in this Paper 🧠:
• Combines non-differentiable ret-mem module with partially trainable decoder-only LM
• Introduces fine-grained, controllable retrieval attention mechanism
• Uses external retriever for historical information retrieval
• Stores past contexts in non-trainable memory bank
• Retrieves chunk-level key-value pairs for input
🔢 The main components of MemLong include:
A ret-mem component for memory and retrieval
A retrieval causal attention module for integrating local and memory information
A non-trainable memory bank for storing past contexts and knowledge
An external retriever for fetching historical information
Lower layers that remain frozen during training
Upper layers that are fine-tuned to calibrate retrieval preferences
Key Insights from this Paper 💡:
• Extends context length from 4k to 80k tokens on a single 3090 GPU
• Avoids distribution shifts in cached information
• Requires fine-tuning of upper layers only, reducing computational cost
• Stores single layer's K-V pairs, allowing significant context extension with minimal memory overhead
Results 📊:
• The model demonstrated the ability to extend context length from 4k up to 80k tokens on a single 3090 GPU.
• Outperforms state-of-the-art LLMs on multiple long-context language modeling benchmarks
• Achieves up to 10.2% point improvement over OpenLLaMA in retrieval-augmented in-context learning tasks
• Demonstrates superior performance in long-context language modeling and retrieval-augmented in-context learning tasks
🚀 Key advantages of MemLong include:
Distributional Consistency: Ensures the distribution of cached information remains consistent, unlike previous models that experienced distribution shifts.
Training Efficiency: Only requires fine-tuning of upper layers, greatly reducing computational cost.
Extensive Context Window: Capable of extending the context window up to 80k tokens on a single 3090 GPU.
Improved Performance: Outperforms other models on long-context language modeling tasks.
Efficient Memory Usage: Stores only a single layer's K-V pairs, allowing for significant context extension with minimal memory overhead.