Browse all previously published AI Tutorials here.
Table of Contents
How to calculate size of KV cache
KV Cache Size in Transformer Models
Calculating KV Cache Size General Formula and Scaling
GPT-4 KV Cache in Extended-Context Generation
Llama 3 Architecture and KV Cache Efficiency
DeepSeek Models and Multi-Head Latent Attention MLA
Recent Optimizations and Research 2024-2025
KV Cache Size in Transformer Models
Transformer-based large language models (LLMs) rely on a key-value (KV) cache during autoregressive generation. This cache stores the keys and values computed at each attention layer for all past tokens, enabling the model to attend to prior context without recomputation. However, the KV cache can consume enormous memory for long sequences. We begin by outlining the general formula for KV cache size, then examine specifics for GPT-4, Llama 3, and DeepSeek – including recent optimizations that emerged in 2024–2025 to make KV storage more efficient.
Calculating KV Cache Size General Formula and Scaling
In a standard Transformer with multi-head attention (MHA), each new token produces a set of key vectors and value vectors at each layer. These must be stored for use by future tokens. The total KV cache memory grows linearly with sequence length L, number of layers N, and model hidden dimension d_model. Formally, for each token we cache two matrices (key and value) per layer. If each key or value vector has dimensionality d_model (assuming all attention head outputs concatenated), the elements per token in the KV cache is:
[ \text{elements per token} = 2 \times N_{\text{heads}} \times d_{\text{head}} = 2 \times d_{\text{model}}, ]
GPT-4 KV Cache in Extended-Context Generation
Memory Offloading: Another consideration for GPT-4’s 128K context is leveraging external memory. Research from 2024 shows that offloading KV tensors to CPU (host) memory can alleviate GPU memory pressure, at some cost to latency ([2502.08910v1.pdf](file://file-PFghuvAEVtWpQpXoLVPTFQ#:~:text=methods%20according%20to%20the%20internal,for%20a%201%20million%20to)). This is a strategy to enable extreme context lengths: for example, the InfiniteHiP system prunes irrelevant tokens and offloads KV to host memory, managing up to 3 million tokens on a single 48 GB GPU . While GPT-4’s implementation isn’t public, it likely uses a combination of these strategies – reducing the KV stored (via architectural choices like MQA/GQA) and possibly intelligent cache management – to handle its long context. The importance of such methods is underscored by recent work: even with weight compression, at 32K–128K sequences the KV cache becomes the main memory bottleneck for LLM inference (KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization). In summary, GPT-4’s extended context demanded innovations to keep KV cache size feasible, including sharing keys/values across heads and advanced caching policies, allowing it to process up to 128K tokens of history without running out of memory.
Llama 3 Architecture and KV Cache Efficiency
Precision and Quantization: Llama 3 and its variants also benefit from lower-precision storage for KV. Storing keys/values in 16-bit floating point is standard, but recent research pushes this to 8-bit or even 4-bit. For example, experiments on Llama-3 show that 3-bit KV cache quantization can be achieved with <0.1 perplexity degradation . This yields enormous memory gains – roughly a 5× increase in effective context length on the same hardware . In fact, a 2024 method called KVQuant demonstrated that using 3-bit compressed KV caches allows LLaMA/Llama-3 models to run with 4.8× longer contexts without additional perplexity penalty . By compressing KV to even 2-bit, they could serve a LLaMA 7B model with 1 million tokens on a single 80 GB GPU, and up to 10 million tokens on an 8-GPU server . These results highlight how quantization can multiply the effective context window. While Llama 3 itself may not ship with 3-bit KV by default, it is designed to be compatible with such optimizations – evidenced by community releases of Llama 3 70B using FP8 (8-bit) KV caches for memory savings. In summary, Llama 3’s architecture (using GQA) and the ongoing research around it (quantized KV caching) reflect a strong focus on KV efficiency, allowing it to double context length with controlled memory growth.
DeepSeek Models and Multi-Head Latent Attention MLA
Recent Optimizations and Research 2024-2025
Beyond the specific models above, there is a flurry of research in 2024 and 2025 addressing KV cache efficiency for Transformers:
Multi-Query and Grouped-Query Attention: As noted, MQA/GQA are now widely adopted in LLMs (PaLM, Llama 2/3, etc.) to reduce KV heads (HERE) . This method strikes a balance between expressiveness and memory use, and is largely considered a solved practice for big models. For instance, Llama 2 70B used 8 KV heads (GQA) (Is the Command R KV cache unusually large? : r/LocalLLaMA), and many new models follow suit.
Cross-Layer Attention (CLA): Introduced in 2024, CLA pushes KV sharing further by reusing keys/values between layers . In a Transformer with CLA, certain layers compute fresh K/V, while others simply reuse the KV cache from a previous layer, effectively halving the number of unique KV sets that must be stored . Brandon et al. (2024) showed that combining CLA with MQA can reduce KV cache size by another 2× on top of MQA’s gains, with only minimal loss in perplexity (HERE). CLA offers an additional memory–accuracy tradeoff knob: one can share KV between every two layers (2× reduction) or among larger groups of layers for more extreme compression . This concept is being explored to allow longer sequences and batch sizes than otherwise possible on fixed memory budgets .
KV Cache Compression and Pruning: Several strategies drop or compress less useful parts of the cache on the fly. SnapKV (Li et al., 2024) learns which past token positions are most “attended-to” and retains only those in the cache for each head, discarding others (SnapKV: LLM Knows What You are Looking for Before Generation). By dynamically selecting important tokens (using an observation window to identify attention patterns), SnapKV achieved an ~8.2× memory reduction and 3.6× speedup at 16K context, with negligible impact on answer quality . It can even scale to ~380K token contexts on a single 80GB GPU (H100/A100) with minor accuracy loss . Another approach, H2O (Liu et al., 2023), uses a heavy-hitter oracle to evict KV entries that the model is unlikely to revisit (HERE). These methods treat the KV cache like an LRU cache or a memory store, keeping it trimmed to the most relevant content to save space. They do not require retraining the model and can be applied during inference (often as a wrapper around the decoding process).
Low Precision Activations: As discussed, quantizing the KV cache is a very active area. Techniques like Per-Channel and Non-uniform Quantization (Hooper et al., 2024; Shridhar et al., 2024) tailor the quantization to the distribution of key/value activations (KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization) . By handling outliers separately and even applying quantization before applying rotary position encodings (to make values easier to compress) , these methods have pushed KV precision down to 3-bit or 2-bit while preserving model accuracy . The payoff is enormous memory savings, as evidenced by KVQuant enabling million-token contexts on hardware that previously maxed out at a few thousand . This line of work suggests that future LLM deployments might routinely use 8-bit or 4-bit KV caches to serve longer prompts without additional cost.
References: The analysis above cites key papers from 2024–2025 that substantiate these points, including arXiv preprints on Cross-Layer Attention (HERE), KV cache quantization (KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization), the DeepSeek technical reports ( DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model), and cache compression techniques like SnapKV .