0:00
/
0:00
Transcript

"ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference"

The podcast on this paper is generated with Google's Illuminate.

ShadowKV compresses LLM memory by storing simplified keys on GPU and values on CPU

📚 https://arxiv.org/abs/2410.21465

Original Problem 🎯:

Long-context LLM inference faces throughput bottlenecks due to expanding Key-Value (KV) cache memory footprint. Existing solutions either compromise accuracy or fail to reduce GPU memory usage effectively.

-----

Solution in this Paper 🔧:

→ ShadowKV stores low-rank pre-RoPE key cache on GPU while offloading value cache to CPU

→ Uses chunk-level approximation strategy with 1.56% sparse budgets

→ Maintains minimal outlier chunks (0.3%) as static cache on GPU

→ Employs CUDA multi-streams to overlap key cache reconstruction with value cache fetching

→ Implements cache-aware optimization reducing computation and data movement by 60%

-----

Key Insights 💡:

→ Pre-RoPE keys are exceptionally low-rank compared to other components

→ High cosine similarity exists between adjacent tokens in post-RoPE key cache

→ Only a small fraction of chunks (0.3%) are outliers needing special handling

→ KV cache exhibits strong temporal locality enabling efficient caching

-----

Results 📊:

→ Supports 6x larger batch sizes across various models

→ Boosts throughput by 3.04x on A100 GPU

→ Achieves 7.2 TB/s equivalent bandwidth

→ Maintains accuracy with just 1.56% sparse KV cache budget

→ Successfully tested on models up to 1M context length

Discussion about this video