Memory-efficient KV cache framework that knows when to compress and when to preserve.
SCOPE framework optimizes KV cache during prefill and decoding phases separately, achieving 35% compression while maintaining performance for long-context generation tasks.
https://arxiv.org/abs/2412.13649
🤖 Original Problem:
→ KV cache becomes a major bottleneck in LLMs during long-context generation, especially for tasks requiring both long inputs and outputs
→ Current methods either retain all decoding cache or discard essential prefill information
-----
🔧 Solution in this Paper:
→ SCOPE separates KV cache optimization into prefill and decoding phases
→ During prefill, it preserves essential context information without excessive compression
→ For decoding, it introduces three strategies: Slide, Adaptive, and Discontinuous
→ Slide strategy selects essential heavy hitters through a sliding window mechanism
→ Adaptive strategy dynamically adjusts cache size based on generation progress
→ Discontinuous strategy optimizes memory transfer by reducing update frequency
-----
💡 Key Insights:
→ Excessive compression during prefill phase significantly impairs reasoning capabilities
→ Heavy hitters deviate during decoding phase in long-text generation tasks
→ Separate optimization of prefill and decoding phases is crucial for performance
-----
📊 Results:
→ Achieves comparable performance to full KV cache with only 35% compression rate
→ Outperforms baselines on LongGenBench benchmark across multiple tasks
→ Successfully integrates with existing prefill-only compression methods
Share this post