"ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference"

Playback speed

Share post at current time

0:00

Transcript

"ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Jan 02, 2025

ShadowKV compresses LLM memory by storing simplified keys on GPU and values on CPU

📚 https://arxiv.org/abs/2410.21465

Original Problem 🎯:

Long-context LLM inference faces throughput bottlenecks due to expanding Key-Value (KV) cache memory footprint. Existing solutions either compromise accuracy or fail to reduce GPU memory usage effectively.

-----

Solution in this Paper 🔧:

→ ShadowKV stores low-rank pre-RoPE key cache on GPU while offloading value cache to CPU

→ Uses chunk-level approximation strategy with 1.56% sparse budgets

→ Maintains minimal outlier chunks (0.3%) as static cache on GPU

→ Employs CUDA multi-streams to overlap key cache reconstruction with value cache fetching

→ Implements cache-aware optimization reducing computation and data movement by 60%

-----

Key Insights 💡:

→ Pre-RoPE keys are exceptionally low-rank compared to other components

→ High cosine similarity exists between adjacent tokens in post-RoPE key cache

→ Only a small fraction of chunks (0.3%) are outliers needing special handling

→ KV cache exhibits strong temporal locality enabling efficient caching

-----

Results 📊:

→ Supports 6x larger batch sizes across various models

→ Boosts throughput by 3.04x on A100 GPU

→ Achieves 7.2 TB/s equivalent bandwidth

→ Maintains accuracy with just 1.56% sparse KV cache budget

→ Successfully tested on models up to 1M context length

Rohan's Bytes

"ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference"

Discussion about this video