0:00
/
0:00
Transcript

"ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression"

The podcast on this paper is generated with Google's Illuminate.

ClusterKV groups similar tokens together to make LLMs run faster with less memory.

ClusterKV introduces semantic clustering for efficient KV cache compression in LLMs, enabling better token recall while reducing memory and computational costs during inference with long contexts.

-----

https://arxiv.org/abs/2412.03213

🔍 Original Problem:

→ LLMs face significant efficiency challenges with increasing context lengths, as KV cache size grows linearly, leading to high memory costs and increased latency.

→ Existing KV cache compression methods either permanently evict tokens or recall them at page-level granularity, degrading model accuracy.

-----

🛠️ Solution in this Paper:

→ ClusterKV implements token recall at semantic cluster granularity instead of fixed pages.

→ It groups tokens with similar semantic features using K-means clustering in the key vector space.

→ The system optimizes clustering through GPU-based parallel processing and efficient kernel implementations.

→ It maintains a cluster-granularity cache to reduce CPU-GPU data transfers.

-----

💡 Key Insights:

→ Tokens close in semantic space exhibit similar attention weights

→ Semantic clustering enables more precise token selection than position-based paging

→ Cluster-based caching significantly reduces memory overhead

-----

📊 Results:

→ Achieves 32k context length using only 1k-2k KV cache budget

→ Delivers 2x speedup in latency and 2.5x improvement in decoding throughput

→ Maintains model accuracy with negligible loss across various tasks

Discussion about this video