"ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.00299
Large Language Models face challenges in long context inference due to high GPU memory demands from the key-value cache. Existing KV cache compression methods often measure token importance individually, which can neglect token dependencies and lose semantic information.
This paper introduces ChunkKV to address this issue. ChunkKV compresses the KV cache by grouping tokens into semantic chunks and preserving the most informative ones, while also employing layer-wise index reuse to enhance efficiency.
-----
📌 ChunkKV innovatively addresses semantic loss in KV cache compression by operating at chunk level. This ensures subject-predicate-object relationships are retained, unlike token-isolated methods, leading to better long-context task performance.
📌 Layer-wise index reuse in ChunkKV significantly cuts down computation. By reusing chunk indices across layers, it smartly exploits redundancy, accelerating inference without substantial accuracy drops.
📌 ChunkKV's chunk-based approach is inherently more efficient than token-based methods for semantic information retention. It compresses context while preserving meaning units, validated by superior performance metrics on diverse benchmarks.
----------
Methods Explored in this Paper 🔧:
→ ChunkKV groups tokens into semantically related chunks as the basic unit for KV cache compression. This method aims to preserve semantic information by either keeping or discarding entire chunks.
→ ChunkKV calculates chunk importance based on the sum of attention scores within an observation window. It then selects the top-k most important chunks to retain in the KV cache.
→ To reduce computational overhead, ChunkKV introduces layer-wise index reuse. This technique reuses the selected chunk indices across multiple transformer layers, leveraging the observed similarity in important chunk indices between layers.
→ The ChunkKV algorithm first calculates attention scores for an observation window. It then divides the KV cache into chunks and computes an aggregate attention score for each chunk.
→ Based on these scores, ChunkKV selects the top-k chunks and retains their corresponding KV entries, effectively compressing the cache while prioritizing semantic units.
-----
Key Insights 💡:
→ Token-level KV cache compression can inadvertently discard essential semantic information by treating tokens in isolation.
→ Compressing KV cache at the chunk level, instead of individual tokens, allows for better preservation of semantic context and dependencies.
→ The indices of important KV cache chunks are similar across different transformer layers. This similarity enables efficient layer-wise index reuse without significant performance degradation.
-----
Results 📊:
→ Achieves up to 10% performance improvement compared to existing KV cache compression methods on long-context benchmarks.
→ Outperforms other methods on GSM8K, LongBench, and Needle-In-A-HayStack benchmarks, demonstrating effectiveness in in-context learning and long-context tasks.
→ Layer-wise index reuse in ChunkKV reduces latency by up to 20.7% and improves throughput by up to 26.5%.