Cut KV cache memory by 75% while keeping 97% performance through smarter compression.
Less bits per token means more tokens in memory.
The paper introduces a novel KV cache compression method that balances token count and numerical precision, achieving better performance while using less memory.
-----
https://arxiv.org/abs/2412.12706
Original Problem 🎯:
→ As LLMs handle longer contexts, KV cache memory usage becomes a major bottleneck during inference, requiring about 20GB for processing 100k tokens
→ Current compression methods focus on either reducing tokens or lowering precision, but not both together
-----
Solution in this Paper 🔧:
→ The paper proposes "quantized pruning" that combines token pruning with precision reduction
→ Instead of storing fewer tokens at high precision (16-bit), it stores more tokens at lower precision (4-bit)
→ It leverages both KV pruning methods to select important tokens and quantization techniques to reduce precision
→ The solution maintains performance while significantly reducing memory usage
-----
Key Insights 🔍:
→ 4-bit precision with 4x tokens outperforms 16-bit precision with 1x tokens
→ Lower precision storage works particularly well for retrieval-heavy tasks
→ The approach remains stable across different model scales and architectures
→ Middle transformer layers benefit more from having more tokens than high precision
-----
Results 📊:
→ Achieves 82.2% accuracy on RULER-8k task using 4-bit precision (vs 67.5% with 16-bit)
→ Maintains 97% performance while reducing KV cache size by 75%
→ Shows consistent improvements across Llama-3 and Mistral-7B models
Share this post