ShadowKV compresses LLM memory by storing simplified keys on GPU and values on CPU
📚 https://arxiv.org/abs/2410.21465
Original Problem 🎯:
Long-context LLM inference faces throughput bottlenecks due to expanding Key-Value (KV) cache memory footprint. Existing solutions either compromise accuracy or fail to reduce GPU memory usage effectively.
-----
Solution in this Paper 🔧:
→ ShadowKV stores low-rank pre-RoPE key cache on GPU while offloading value cache to CPU
→ Uses chunk-level approximation strategy with 1.56% sparse budgets
→ Maintains minimal outlier chunks (0.3%) as static cache on GPU
→ Employs CUDA multi-streams to overlap key cache reconstruction with value cache fetching
→ Implements cache-aware optimization reducing computation and data movement by 60%
-----
Key Insights 💡:
→ Pre-RoPE keys are exceptionally low-rank compared to other components
→ High cosine similarity exists between adjacent tokens in post-RoPE key cache
→ Only a small fraction of chunks (0.3%) are outliers needing special handling
→ KV cache exhibits strong temporal locality enabling efficient caching
-----
Results 📊:
→ Supports 6x larger batch sizes across various models
→ Boosts throughput by 3.04x on A100 GPU
→ Achieves 7.2 TB/s equivalent bandwidth
→ Maintains accuracy with just 1.56% sparse KV cache budget
→ Successfully tested on models up to 1M context length
Share this post