KV Caching in LLM Inference A Comprehensive Review

May 05, 2025

Browse all previoiusly published AI Tutorials here.

KV Caching in LLM Inference A Comprehensive Review
Theoretical Advancements in KV Caching
Practical Implementation in Modern LLMs
Industry Applications of KV Caching
KV Caching vs Other Optimization Techniques
KV Caching in PyTorch TensorFlow and JAX
Performance Benchmarks and Model Comparisons

Key-Value (KV) caching is a technique used in large language model (LLM) inference to store the key and value tensors from previous decoding steps. By reusing these stored tensors for each new token’s attention computation, KV caching avoids redundant calculations and significantly accelerates autoregressive generation. This review covers recent theoretical advancements in KV caching (2024–2025), practical integration strategies in model architectures, real-world enterprise use cases, comparisons with alternative optimizations, framework-specific implementations (PyTorch, TensorFlow, JAX), and performance benchmarks from cutting-edge LLMs like Mistral, DeepSeek, and OpenAI’s latest models.

Theoretical Advancements in KV Caching

Avoiding Recomputing Attention: In a transformer decoder, generating each new token involves computing self-attention against all prior tokens. KV caching mitigates this by storing past keys and values so that each iteration only computes attention for the latest token’s queries. This yields a dramatic speedup: after the first token (which computes full attention for the prompt), subsequent tokens reuse cached KV pairs and only add one new key and value each time. The result is a roughly constant per-token latency after the first token, greatly improving throughput for long sequences.

Challenges with Cache Growth: A drawback is that the KV cache grows linearly with sequence length, consuming substantial memory. For each generated token, caches must store a key and value vector per transformer layer and head. For example, LLaMA-2 13B uses ~1 MB of cache per output token. Over a 4K token context, this is ~4 GB per sequence – comparable to the model size – and even larger for bigger models or batches. This growth leads to memory bottlenecks and increased attention bandwidth, especially in long-context or long-response tasks.

Recent Research (2024–2025): A surge of research aims to compress or limit KV caches without sacrificing model performance:

Constant-Size Caches: MorphKV (2025) introduces an adaptive method to maintain a fixed-size KV cache by selectively retaining the most relevant key/value pairs. Instead of dropping old tokens arbitrarily, MorphKV uses attention patterns to iteratively refine which past tokens to keep, preserving long-range dependencies with minimal accuracy loss. This yields >50% memory savings over prior methods while even improving long-form accuracy in benchmarks.
Cache Compression: MiniCache (2024) compresses the KV cache across layers by merging adjacent layers’ states. It observes high similarity in deep layers’ KV tensors, so it “disentangles” each state into magnitude and direction, then interpolates directions between layers to reduce depth redundancy. MiniCache achieved up to 5× compression (e.g. 41% less memory with near-lossless performance) and ~5× throughput gain on LLaMA-2 using 4-bit compressed caches.
Selective Retention: SnapKV (2024) takes a fine-tuning-free approach by selecting only the “important” past token positions for each attention head. It finds that each head mainly attends to a subset of prompt features, identifiable via an observation window. SnapKV clusters and keeps those crucial KV entries, discarding others. This yields up to 3.6× faster generation and 8.2× lower memory use on 16K-token inputs, with negligible accuracy drop. Impressively, SnapKV enabled processing a 380K-token context on a single 80GB GPU (Qwen-7B model) with only minor quality loss.
Quantization of KV: AQUA-KV (2024) (“Cache Me If You Must”) dynamically quantizes KV tensors to shrink memory while maintaining accuracy. By adaptively allocating precision based on content, AQUA-KV achieved higher compression rates on LLaMA 3.x models compared to static quantization, supporting extremely long contexts (aiming for 10M token inference). Other works like KVQuant pursue 4-bit or mixed-precision KV caching to reach context lengths that were previously infeasible.
Sparse or Retrievable KV: Beyond compression, another direction is to avoid storing all KV by using sparse attention patterns. RetrievalAttention (2024) proposes offloading past keys/values to CPU and using approximate nearest-neighbor search to retrieve only the most relevant ones for each new token. It addresses distribution mismatches with an attention-aware retrieval algorithm, achieving near-full accuracy while accessing just 1–3% of cached data. This drastically cuts memory and still allows e.g. serving 128K-token contexts on a single 24 GB GPU (8B model) with only ~0.188s per token.
I write everyday for my readers on actionable AI. Subscribe and instantly get a 1300+ page Python book.

In summary, the theoretical advances aim to balance memory and compute: either by making the cache footprint smaller (through compression, quantization, or fixed windows) or by structuring the attention to not need every past token (via selective or sparse retrieval of KV). These innovations are crucial as LLMs scale to longer contexts and are expected to play a key role in next-generation models.

Practical Implementation in Modern LLMs

Modern transformer-based LLM architectures have built-in support for KV caching to accelerate inference. In practice, KV caching is enabled by default in most decoder models. For instance, the Hugging Face Transformers library automatically caches past_key_values when use_cache=True, so each new decoding step concatenates the latest key/value with the previous cache instead of recomputing it. Below is a simplified PyTorch-style pseudocode illustrating this mechanism:

## Simplified KV cache usage in PyTorch
if cache["key"] is None:
    cache["key"] = new_key        # first token: initialize cache
    cache["value"] = new_value
else:
    # Append new KV to cache along sequence dimension
    cache["key"] = torch.cat([cache["key"], new_key], dim=1)
    cache["value"] = torch.cat([cache["value"], new_value], dim=1)

## Use cached keys/values for attention in the next forward pass
output = self_attention(query=current_query,
                        keys=cache["key"], values=cache["value"])

Using such a cache means the self-attention for token n only computes Attention(Q_n, K_{1:n}, V_{1:n}) where K1:nK1:n/V1:nV1:n include all past keys/values. This yields identical results to recomputing from scratch, but is much faster for long text.

Connect with me on X (Twitter)

Memory Management Strategies: Because naive caching can bloat memory, LLMs incorporate strategies to manage cache size:

Sliding Window Cache: Some models (e.g. Mistral-7B) limit cache length by evicting old tokens beyond a window. Mistral uses a sliding window of W (4096 in some configs) where only the most recent W tokens’ KV pairs are kept; older ones are dropped as new tokens come in. Despite this, the model can still indirectly leverage older context through Transformer layers stacking (each layer’s state can carry forward information). Sliding Window Attention allows supporting long contexts (e.g. 16K) with a fixed memory cost instead of linear growth.
Grouped-Query Attention (GQA): Reduces the number of KV vectors by using fewer “effective” heads. For example, LLaMA-2 70B employs GQA to share key/value projections across groups of heads, cutting memory usage with minimal quality loss. GQA and similar architectural changes shrink the KV cache per token by reducing nheadsnheads or head dimension.
Paged Attention: The vLLM library introduced PagedAttention, which treats GPU memory like virtual memory pages for KV storage (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog) . Instead of one big contiguous cache per sequence (which leads to fragmentation and unused space when sequences vary in length), PagedAttention breaks the KV cache into fixed-size blocks (“pages”) that can be dynamically allocated and reused. This innovation virtually eliminated memory waste (reducing KV memory fragmentation from ~70% to <4%) and allowed serving many requests efficiently. In practice, vLLM with PagedAttention achieved up to 24× higher throughput than naive HuggingFace inference by packing sequences more efficiently and maximizing GPU utilization.
Distributed and Offloaded KV: In multi-GPU setups, caches can be sharded by layers or sequence to fit longer contexts. System-level optimizations include offloading parts of the cache to CPU or even disk when GPU memory is tight, or sharing a prefix cache across requests with identical prompts to avoid duplicate storage. Frameworks like NVIDIA’s TensorRT-LLM offer an API to reuse KV cache pages across requests with the same prefix, dramatically reducing time to first token for repetitive prompts.
Static vs. Dynamic Cache: Typically, frameworks allocate a dynamic cache that grows as needed (up to the max context). Some scenarios use a static cache – preallocating the maximum size – which avoids reallocation overhead and can be reused across sequences. A static cache is useful if one knows the input sizes in advance or for batch processing fixed-length prompts. Offloaded static caches can even keep long-context states on CPU as a fallback to prevent OOM errors.
I write everyday for my readers on actionable AI. Subscribe and instantly get a 1300+ page Python book.

In summary, practical LLM implementations combine caching with smart memory management. By default, KV caching ensures that each new token only incurs O(n) attention cost (where n is sequence length) instead of O(n²), greatly improving speed for long sequences. At the same time, mechanisms like windowed caches, shared heads, and paged memory prevent the cache from outgrowing hardware limits.

Industry Applications of KV Caching

KV caching has become crucial in enterprise AI deployments of LLMs, where inference efficiency directly impacts usability and cost:

Interactive Chatbots and Assistants: Multi-turn conversation with an LLM (such as customer service bots or personal assistants) relies on caching to maintain context. Each user message appends to a growing dialog history that the model must attend to. KV caching allows the model to remember all prior turns without recomputing them on each response, enabling real-time interactions with low latency per turn. For example, ChatGPT-style systems use KV caches to carry the conversation state from question to question in GPU memory, so responses stay fast even as history grows.
Streaming and Real-Time Applications: In use cases like code auto-completion, IDE assistants, or live transcription, the model generates tokens incrementally as new input arrives (e.g. each keystroke). KV caching here is vital to achieve low-latency updates. By reusing the previously computed context, the model can instantly extend the output without a full re-run. Microsoft’s GitHub Copilot and similar coding assistants rely on this to respond to each character typed, essentially treating the evolving file as a continuous context where caching past context enables sub-second suggestions.
Long-Document Processing: Enterprise applications often involve summarizing or querying long documents (contracts, logs, research papers). KV caching enables long context inference, where an LLM can handle inputs that are thousands of tokens long. Systems like Claude 2 (100K context) and GPT-4 (32K) manage such inputs by caching internal states instead of recomputing huge prefixes repeatedly. This makes it feasible to get answers from long texts in reasonable time. As context windows continue to expand (OpenAI’s GPT-4, Anthropic’s models, etc.), robust KV caching (often combined with retrieval or compression) is what allows these models to stay tractable in production.
Multi-user LLM Serving: In cloud and enterprise servers, a single GPU often serves many users’ requests concurrently. KV caching helps reuse shared prompt prefixes across requests. For example, if multiple users use the same system or instruction prompt (common in deployed chatbots), the server can compute that prefix once, cache the KV, and reuse it for all users – only computing user-specific continuations. This significantly cuts down duplicate work and improves throughput. NVIDIA’s Triton and TensorRT-LLM provide libraries to manage such KV cache sharing safely across inference requests.
Cost Reduction and Efficiency: Companies are actively investing in KV cache optimization to cut inference costs. For instance, Snowflake’s SwiftKV (2024) is an enterprise-focused solution that compresses and distills the model to reduce prefill (prompt processing) computation by skipping half the layers, leveraging KV caching to retain performance. SwiftKV demonstrated up to 2× better throughput and latency with minimal accuracy loss by reducing redundant work on cached tokens. Such optimizations can translate to ~75% lower serving costs in real-world LLM workloads, making deployments more economically viable.

In essence, any scenario that involves sequential or repeated LLM queries benefits from KV caching. It’s a standard feature in production-grade LLM serving frameworks (like Hugging Face Text Generation Inference, vLLM, and NVIDIA TensorRT) to ensure fast and efficient responses. Without caching, many interactive AI applications would be too slow or too expensive to be practical.

Connect with me on X (Twitter)

KV Caching vs Other Optimization Techniques

KV caching is one of several techniques to speed up LLM inference. It often works in tandem with other optimizations, but there are important distinctions:

KV Caching vs. Speculative Decoding: Speculative decoding accelerates inference by leveraging a smaller “draft” model to generate multiple tokens in parallel, and then having the large model validate them in one go. For example, a lightweight model might predict the next 5 tokens, which the big model then approves or corrects in a single batch step – amortizing the cost. This can provide 2–4× throughput gains if the draft is accurate. Difference: KV caching speeds up each token’s computation by reusing history, whereas speculative decoding reduces the number of sequential steps needed by guessing ahead. They are complementary – one can use KV caching and speculative decoding together (indeed, the large model still uses KV cache during speculative verification). The trade-off is that speculative decoding requires maintaining two models and slightly more computation upfront, whereas KV caching is a straightforward reuse of computation. Notably, speculative decoding doesn’t solve the memory growth issue of long contexts, whereas KV cache management techniques do. In practice, speculative decoding is complex to implement but can dramatically increase token throughput, as shown by NVIDIA’s TensorRT-LLM achieving up to 3.6× speedups with it.
KV Caching vs. FlashAttention: FlashAttention (and its successors FlashAttention-2/3) is an optimized attention kernel that reduces memory reads/writes and maximizes GPU parallelism for the attention calculation. FlashAttention rearranges the computation of softmax(QKT)Vsoftmax(QKT)V to use tiling and on-chip memory, greatly speeding up both training and inference attention without approximation. Difference: FlashAttention targets the low-level efficiency of computing attention given Q, K, V – it does not change the fact that you have to attend over n tokens, but it makes each attention step faster and more memory-efficient. KV caching, on the other hand, reduces the number of operations by not recomputing K, V for past tokens at all. In effect, FlashAttention and KV caching solve orthogonal bottlenecks: KV caching skips redundant work, while FlashAttention makes the remaining work (computing attention for the current token over n keys) faster and more memory-efficient. (Flash-Decoding for long-context inference | PyTorch). In long-context scenarios, attention still scales linearly with n even with caching; FlashAttention helps mitigate the large constant factors involved. For instance, PyTorch’s Flash-Decoding (2023) extends FlashAttention ideas to the decoding setting, yielding up to 8× faster generation for very long sequences by efficiently loading keys/values in parallel and reducing memory bandwidth usage . In summary, KV caching and FlashAttention are often used together in modern LLMs: KV caching ensures we only compute each token’s key/value once, and FlashAttention ensures that using those keys/values in attention is as fast as possible.
I write everyday for my readers on actionable AI. Subscribe and instantly get a 1300+ page Python book.
KV Caching vs. Other Techniques: There are other inference optimizations like quantization (reducing precision to speed up math and memory), pruning or distillation (making the model smaller or simpler), and batch processing multiple tokens. These can all stack with KV caching. The closest alternative to KV caching would be recomputing everything (which is simply much slower for long outputs) or using architectural changes like recurrent memory or state forwarding. Some research models replace the standard KV cache with a learned summarization of past sequence (like an RNN-style state or compressed memory) – but these are not yet mainstream for general LLMs. As it stands, KV caching is the de facto standard for auto-regressive decoding in Transformers, whereas methods like speculative decoding or specialized kernels are optional enhancements for further speed gains. In practice, a production LLM stack might use all of the above: KV caching + FlashAttention kernel + quantized weights + possibly speculative decoding for maximum throughput.

KV Caching in PyTorch TensorFlow and JAX

PyTorch (Hugging Face Transformers): PyTorch models typically implement KV caching by accepting past_key_values (or past) in the model’s forward pass. For example, the generate() method in Hugging Face will pass the cache to the model each step and concatenate new KVs internally. As shown in the pseudocode earlier, implementing a basic cache is straightforward with tensor concatenation. PyTorch’s default transformer modules (e.g. nn.TransformerDecoder) do not automatically cache; caching is handled at the model implementation level or by using libraries. The Hugging Face implementation uses a tuple of tensors for each layer’s past K and V, and updates them each iteration. Enabling use_cache=True returns these tensors from the model output, which are then fed back in for the next token. This makes it easy for developers – KV caching “just works” by default for GPT-2, GPT-3, LLaMA, and similar models. For custom PyTorch models, developers either manually manage caches as in the pseudocode, or leverage libraries like [Hugging Face Transformers integration].

TensorFlow / Keras: In TensorFlow, one can achieve KV caching by maintaining state between calls to the model. For instance, a Keras model can be written to accept a past_kv and return updated past_kv in its output. Although TensorFlow doesn’t have a single standard API for autoregressive cache like HF’s, the concept is similar – you carry a Python dictionary or list of tensor states across iterations. Google’s TensorFlow Lite and MediaPipe recently introduced an on-device LLM Inference API which explicitly supports caching to enable running large models on-device. They include new ops and pipeline support for caching so that even mobile/edge deployments can reuse KV states rather than recompute. An example is given in their docs, where converting a model to TFLite with the LLM support will handle stateful inference (the developer simply calls the model step by step, and the state is implicitly cached under the hood). Essentially, while PyTorch (via HF) has caching built into model code, in TensorFlow you often handle it at the application level (looping over generation steps and feeding the state back each time).

JAX / Flax: JAX is slightly tricky for KV caching because of its functional, JIT-compiled nature. You typically need to pre-allocate a fixed-size cache array (for max context) and update it immutably at each step (or use a mutable state if using Flax’s Linen modules) (How to do autoregressive decoding in JAX/Flax? #920 - GitHub). Flax’s SelfAttention modules can be configured for causal masking and one can carry the cache in the model’s state (e.g., as a Flax Variable). The challenge is that JAX requires array sizes to be known at compile time for efficient execution, so dynamic sequence growth is non-trivial. Some community solutions create a custom scan or roll the cache array and use slicing to simulate appending. Projects like whisper-jax manually implement KV caching by initializing a cache of shape (max_len, ..., dim) and slicing into it for each chunk of input. Despite the extra work, JAX models can achieve fast inference with caching if designed carefully. The benefit of JAX is that one can still JIT-compile the generation loop with a static cache size for maximum speed. Overall, PyTorch and TensorFlow provide more out-of-the-box convenience for KV caching, whereas JAX demands more manual handling or advanced planning due to its static shape constraints (KV Caching - JITx).

Example – Hugging Face Transformers (PyTorch): Using the Transformers library, KV caching is automatically used. For example:

import transformers
model = transformers.AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("Hello, my name is", return_tensors='pt')
## Generate text with kv-caching enabled by default
outputs = model.generate(**inputs, max_new_tokens=50, use_cache=True)

Under the hood, the model’s forward will receive past_key_values on subsequent calls and concatenate the cache as described. If use_cache=False, the model would recompute attention from scratch each time – resulting in significantly lower generation speed for long sequences. In TensorFlow, a similar routine can be done using a loop with the tf.function compiled model, carrying a state. For JAX/Flax, one might use a for loop in a jax.jit with a preallocated past_k and past_v and update them at each iteration.

Performance Benchmarks and Model Comparisons

Optimizing KV caching yields tangible performance improvements in modern LLMs. We highlight some benchmark results and capabilities of recent models:

Mistral 7B (2023): Mistral 7B is notable for its efficiency relative to size. It significantly outperforms LLaMA-2 13B on most benchmarks, despite having about half the parameters. One factor is its optimized attention mechanism – Mistral uses sliding-window caching (with a 4K window) to support long contexts without quadratic slow-down or memory blow-up. This clever use of a rolling KV cache means it can handle 16K context length while keeping the cache size bounded, giving it an edge in both speed and memory usage for long prompts. Some reports showed 70% faster inference throughput for Mistral 7B compared to LLaMA-2 when running on the same hardware (Just benchmarked LLama 2 and Mistral with all the popular inference engines across all precisions : r/LocalLLaMA). In other words, Mistral achieves higher tokens-per-second generation, likely due to its architecture and caching strategies that reduce overhead.
DeepSeek LLM (2024): DeepSeek is a family of newer open-source LLMs (e.g., DeepSeek 67B, DeepSeek-R1, DeepSeek-V3) that emphasize high performance and long context handling. The DeepSeek LLM 67B is reported to surpass LLaMA-2 70B on various benchmarks, excelling in coding and math tasks. The latest DeepSeek-V3 version even claims to outperform a 405B-parameter Llama 3.1 model and GPT-4 on key benchmarks. These gains come from extensive training optimization, but also from efficient inference design – DeepSeek models support contexts up to 128K tokens. Achieving 128K context requires sophisticated KV cache management (likely compression or segmentation) because a naive cache for 128K tokens would be enormous. While details are not fully public, such context lengths suggest techniques similar to those in research (e.g., selective caching, retrieval-based attention) are integrated. In practice, DeepSeek’s optimized inference yields strong throughput: for instance, third-party tests show DeepSeek 67B can deliver competitive token-per-second rates relative to smaller models, thanks to these optimizations in caching and data types (it uses FP8 quantization for efficiency) (DeepSeek LLM Chat (67B) · AI Models · LobeChat). This combination of high quality and efficient caching makes DeepSeek attractive for enterprises seeking GPT-4 level performance with faster, cost-effective inference.
OpenAI GPT Models (GPT-3.5 & GPT-4, 2023): OpenAI’s production models also heavily rely on KV caching. GPT-3.5-turbo introduced a 16K context option (up from 4K), and GPT-4 offers 8K and 32K context versions. Using these larger context windows greatly increases the KV cache size (linear with context length), which in turn slows down generation compared to shorter-context models (GPT-3.5 vs. GPT-4: Biggest differences to consider | TechTarget). Indeed, GPT-4’s inference is notably slower than GPT-3.5 , partly because GPT-4 is larger and also because the 32K window means much more data to attend to. OpenAI likely employs advanced optimizations under the hood: they have not published specifics, but it’s reasonable to assume things like FlashAttention kernels (for faster attention computation) and distributed caching across multiple GPUs for the 32K model. Even so, users observed that GPT-4-32k throughput is lower; for instance, one external analysis measured GPT-4’s output rate at ~37 tokens/sec in a setting, whereas GPT-3.5 could be much faster . This highlights that KV cache cost is a limiting factor at very large contexts – reading/writing those huge caches each token is expensive. It underscores the need for the aforementioned research (like compression and constant-size caches) to make 100K+ contexts truly practical. Nonetheless, without KV caching these large-window models would be virtually impossible to run – caching is what allows GPT-4 to handle long conversations or documents by not recomputing the entire 32K token history for each new token.
vLLM and Other Systems: In terms of serving throughput, the combination of efficient KV caching and batching can dramatically improve performance. The vLLM system (with PagedAttention) demonstrated 14× to 24× higher throughput than naive implementations when serving LLaMA-7B and 13B models (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog). Much of this gain comes from maximizing GPU memory usage and avoiding redundant KV overhead per request. Similarly, Meta’s own optimizations in their production systems (as discussed in papers and engineering blogs) use prefix caching (where static system prompts are pre-cached) and efficient scheduling to achieve high throughput for chat applications. These real-world benchmarks prove that KV caching is not just a theoretical speed-up, but a necessity for scaling LLMs to millions of users. For example, with proper caching and batching, an A100 GPU can serve multiple dozen tokens per second for a 13B model; without caching, it would drop to a small fraction of that.
I write everyday for my readers on actionable AI. Subscribe and instantly get a 1300+ page Python book.

In conclusion, KV caching is a cornerstone of high-performance LLM inference. The latest models and frameworks all incorporate caching to avoid wasted computation, and recent innovations are pushing the envelope to support longer contexts and higher throughput. By combining KV caching with other techniques like speculative decoding and FlashAttention, the community has achieved substantial speedups (3–8× in many cases) (Flash-Decoding for long-context inference | PyTorch). Enterprise benchmarks (Mistral vs LLaMA, DeepSeek vs larger models, etc.) consistently show that models or systems optimizing KV cache usage can deliver superior performance. As research continues, we expect KV caching strategies to become even more advanced, enabling efficient inference even as LLMs scale to 100B+ parameters and context windows of hundreds of thousands of tokens. The ongoing challenge is to retain the benefits of remembering the past without incurring an untenable memory or latency cost – a balance that KV caching techniques strive to master.

Sources:

Li et al. “A Survey on LLM Acceleration via KV Cache Management.” arXiv, 2024 ([2412.19442] A Survey on Large Language Model Acceleration based on KV Cache Management](https://arxiv.org/abs/2412.19442#:~:text=,level optimizations)) ([2412.19442] A Survey on Large Language Model Acceleration based on KV Cache Management](https://arxiv.org/abs/2412.19442#:~:text=Token,management techniques%2C contributing to the)).
Mallis, O. “Techniques for KV Cache Optimization in LLMs.” Blog, 2023.
Ghadia et al. “Dialogue Without Limits: Constant-Sized KV Caches (MorphKV).” arXiv, 2025.
Liu et al. “MiniCache: KV Cache Compression in Depth.” arXiv, 2024.
Li et al. “SnapKV: LLM Knows What You Are Looking for.” arXiv, 2024.
Dao et al. “Flash-Decoding for long-context inference.” PyTorch Blog, 2023 (Flash-Decoding for long-context inference | PyTorch) .
Putterman et al. “TensorRT-LLM Speculative Decoding.” NVIDIA Tech Blog, 2024.
Hugging Face Transformers Documentation – KV Cache Usage.
Snowflake AI – “SwiftKV: Compute Reduction for LLMs.” (Blog/ArXiv), 2024.
Mistral AI – Mistral 7B Model Card. 2023.
DeepSeek AI – DeepSeek-V3 Announcement. 2024.
TechTarget – “GPT-3.5 vs GPT-4 (Performance)”. 2023 (GPT-3.5 vs. GPT-4: Biggest differences to consider | TechTarget).
Kwon et al. “PagedAttention: Memory-Efficient LLM Serving (vLLM).” SOSP 2023 (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog) .
Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post