ACtivation REfilling: A two-layer cache system that makes LLMs handle massive documents without breaking a sweat
ACRE (ACtivation REfilling) introduces a bi-layer cache system that dramatically improves how LLMs handle long contexts while maintaining efficiency and performance.
https://arxiv.org/abs/2412.12486
Original Problem 🤔:
→ Current LLMs struggle with long contexts due to context window limitations and computational burden from key-value (KV) activations
→ Existing methods either lose semantic richness when compressing information or focus too narrowly on local details, missing the global picture
-----
Solution in this Paper 🔧:
→ ACRE constructs a bi-layer KV Cache with Layer-1 (L1) capturing global information compactly and Layer-2 (L2) providing detailed local context
→ For input queries, ACRE uses L1 cache for initial attention computation
→ Based on attention scores, it dynamically refills L1 with relevant L2 entries
→ Uses selective attention to optimize computation - tokens attend fully to recent L1/L2 tokens but only to distant L1 tokens
-----
Key Insights 💡:
→ Information needs vary dynamically from local details to global perspective based on query complexity
→ A bi-layer approach can balance global understanding with local precision
→ Query-guided refilling enables adaptive context selection
-----
Results 📊:
→ Outperforms baseline models across 12 information-seeking datasets
→ Processes contexts over 512K tokens with 40% less GPU memory than standard approaches
→ Reduces latency by 45% compared to existing methods while maintaining answer quality
Share this post