MemLong: Memory-Augmented Retrieval for Long Text Modeling

Nov 04, 2024

MemLong: Your LLM's external hard drive for handling massive documents.

MemLong stores past context in memory banks, letting LLMs handle 80k tokens on a single GPU 💡

External memory retrieval helps MemLong process 20x more context. ✨

Solution in this Paper 🧠:

• Combines non-differentiable ret-mem module with partially trainable decoder-only LM

• Introduces fine-grained, controllable retrieval attention mechanism

• Uses external retriever for historical information retrieval

• Stores past contexts in non-trainable memory bank

• Retrieves chunk-level key-value pairs for input

🔢 The main components of MemLong include:

A ret-mem component for memory and retrieval
A retrieval causal attention module for integrating local and memory information
A non-trainable memory bank for storing past contexts and knowledge
An external retriever for fetching historical information
Lower layers that remain frozen during training
Upper layers that are fine-tuned to calibrate retrieval preferences

Key Insights from this Paper 💡:

• Extends context length from 4k to 80k tokens on a single 3090 GPU

• Avoids distribution shifts in cached information

• Requires fine-tuning of upper layers only, reducing computational cost

• Stores single layer's K-V pairs, allowing significant context extension with minimal memory overhead

Results 📊:

• The model demonstrated the ability to extend context length from 4k up to 80k tokens on a single 3090 GPU.

• Outperforms state-of-the-art LLMs on multiple long-context language modeling benchmarks

• Achieves up to 10.2% point improvement over OpenLLaMA in retrieval-augmented in-context learning tasks

• Demonstrates superior performance in long-context language modeling and retrieval-augmented in-context learning tasks

🚀 Key advantages of MemLong include:

Distributional Consistency: Ensures the distribution of cached information remains consistent, unlike previous models that experienced distribution shifts.
Training Efficiency: Only requires fine-tuning of upper layers, greatly reducing computational cost.
Extensive Context Window: Capable of extending the context window up to 80k tokens on a single 3090 GPU.
Improved Performance: Outperforms other models on long-context language modeling tasks.
Efficient Memory Usage: Stores only a single layer's K-V pairs, allowing for significant context extension with minimal memory overhead.

Rohan's Bytes