Browse all previously published AI Tutorials here.
Table of Contents
Introduction
Hardware Optimizations
Quantization
Pruning and Sparsity
Parallelism and Batching
Software-Level Enhancements
Caching Mechanisms
Efficient Chunking & Long-Context Strategies
Retrieval-Augmented Generation RAG
Speculative Decoding
Structured Analysis of Techniques
Citations
Introduction
Optimizing the inference throughput of large language models (LLMs) is critical as these models grow in size and are deployed in latency-sensitive applications. Modern LLMs with billions of parameters demand massive computation, leading to high inference costs and latency (Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing). Scaling context lengths to hundreds of thousands of tokens further amplifies computation and memory overhead (Compute Or Load KV Cache? Why not Both?) . Ensuring efficient online inference has thus become a key challenge for widespread adoption of LLMs . This review surveys recent (2024–2025) techniques that boost LLM inference throughput, covering hardware-centric optimizations and software-level enhancements.
Hardware Optimizations
Quantization:
Reducing numerical precision of model weights (and sometimes activations) is a widely used approach to speed up LLM inference. Low-bit weight quantization (e.g. 8-bit, 4-bit) can shrink memory usage and multiply-accumulate cost significantly with minimal impact on accuracy ( I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models). For instance, a 4-bit weight and 4-bit activation configuration was shown to incur negligible accuracy loss on LLMs . Mixed-precision schemes often keep activations at higher precision (e.g. 8-16 bit) since activations are harder to quantize without loss due to outlier values (MixPE: Quantization and Hardware Co-design for Efficient LLM Inference) . Specialized inferencing kernels and hardware support are being developed to realize the full speedups of quantization. Integer-only inference pipelines eliminate runtime de-quantization overhead by performing all operations (even softmax and normalization) in integer math . Such methods enable efficient deployment of fully-quantized models on common hardware, closing the gap between low-bit compression and real-world throughput gains.
Pruning and Sparsity:
Another model-centric optimization is pruning redundant weights or structures to reduce model size and computation. Structured pruning (removing entire channels, neurons, or attention heads) is particularly attractive as it yields speedups on regular hardware by skipping whole operations (Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing). However, aggressive pruning of LLMs can severely hurt quality if not carefully done . Recent research focuses on smarter pruning criteria and dynamic sparsification. For example, Probe Pruning performs a brief “pilot run” of a few layers on a small subset of tokens to identify critical weights for each input batch, then prunes the rest adaptively . This dynamic, input-aware approach achieved substantially better trade-offs – at 40% of weights pruned, it saw 2.56× less performance degradation per unit speedup compared to prior static methods . Other advances include improved importance metrics (e.g. block-wise influence propagation) that more accurately target removable parameters, yielding pruned models with higher accuracy at the same sparsity (LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation) . In practice, structured sparsity can directly lower inference latency without specialized hardware , whereas unstructured sparsity (arbitrary weight pruning) typically requires sparse computation libraries or accelerator support to see speed gains. Combining quantization with pruning can further compound efficiency improvements.
Parallelism and Batching:
Leveraging parallel hardware at both model and request levels can dramatically increase throughput. Large models are often served with tensor/model parallelism, partitioning the network across multiple GPUs to utilize aggregate memory and compute. Pipeline parallelism (pipelining the layers across devices) and optimized scheduling can keep all GPUs busy and minimize idle times. For example, the Sarathi-Serve scheduler coordinates chunked model execution across GPUs such that new requests can “piggyback” on ongoing generations without stalls ( Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve). In one study, this approach yielded 2.6× higher throughput for a 7B model on a single GPU (and up to 5.6× when scaling to a 180B model across multiple GPUs) compared to a baseline system . At the request level, batching multiple inference queries together amortizes overheads and is especially effective during the autoregressive decode phase where each iteration would otherwise utilize only a fraction of the GPU . Larger batch sizes improve device utilization but can increase per-request latency, creating a throughput–latency trade-off . Solutions like Sarathi mitigate this by splitting long prompts into chunks and interleaving them with batched decode steps, achieving high throughput with minimal added latency . In addition, optimized attention kernels (e.g. fused or block-sparse implementations) exploit GPU parallelism to accelerate the O(n²) attention operation for long sequences. Overall, parallel processing – from low-level GPU kernels to distributed inference across nodes – is a cornerstone of high-throughput LLM serving.
Software-Level Enhancements
Caching Mechanisms:
Caching reuse of computation can drastically cut redundant work in LLM inference. In autoregressive generation, models cache key-value (KV) pairs from past tokens so that each new token only computes attention against recent additions rather than the full history (HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading). This decoder KV cache is standard for reducing per-token costs. Beyond that, prefix caching can reuse computation across requests that share the same initial tokens (e.g. system or prompt prefixes, or repeated documents) (Compute Or Load KV Cache? Why not Both?) . Industry deployments report that prefix KV caching can cut inference costs by over 50% by avoiding recomputation of common prompt prefixes . The challenge is managing these caches efficiently across memory hierarchies. Techniques like HeadInfer reduce memory pressure by offloading less-used attention heads’ cache to CPU, enabling extreme context lengths without GPU overflow . In one case, HeadInfer shrank the KV cache memory of a 8B model at 1M context from 128 GB to 1 GB (a 128× reduction) with minimal overhead . Meanwhile, the Cake system tackles long-context scenarios by overlapping KV cache computation vs. I/O – it dynamically decides whether to recompute a cache segment or load it from disk, and does both in parallel when possible . This hybrid scheduling cut time-to-first-token by ~2.6× on average in long-context inference . Effective caching and offloading thus trade extra memory/storage for significant speedups, especially in multi-turn conversations and retrieval-oriented workloads where repeated content is common.
Efficient Chunking & Long-Context Strategies:
When faced with very long inputs or contexts, splitting or selectively processing the content can improve efficiency. Instead of feeding an entire long document to an LLM at once (which is slow and quadratic in cost), chunking strategies break the input into smaller segments that can be processed sequentially or in parallel. Systems may then combine the per-chunk outputs (through concatenation, voting, or an additional integrating step). A key optimization is to avoid wasting computation on irrelevant parts of context. For example, InfiniteHiP uses a hierarchical token pruning algorithm to dynamically discard tokens that are not pertinent to the current query as the sequence grows (Extending Language Model Context Up to 3 Million Tokens on a Single GPU). By doing so, it processes up to 3 million tokens on a single 48 GB GPU without losing relevant context, achieving nearly 19× faster attention computation on a 1M-token sequence . Other models use fixed sparse attention patterns (e.g. windowed or dilated attention) to curb complexity, as seen in Mistral which adopted sliding window attention to boost throughput while preserving performance comparable to larger fully-attentive models (GitHub - AIoT-MLSys-Lab/Efficient-LLMs-Survey: [TMLR 2024] Efficient Large Language Models: A Survey) (GitHub - AIoT-MLSys-Lab/Efficient-LLMs-Survey: [TMLR 2024] Efficient Large Language Models: A Survey). The general principle is to limit the portion of the context that is fully attended at each step, whether via retrieval of top-kk relevant chunks, chunk-by-chunk processing with intermediate summaries, or sparse attention masks. These approaches significantly alleviate the memory and compute burden of long contexts, albeit with added system complexity to manage chunking or retrieval pipelines.
Retrieval-Augmented Generation (RAG):
RAG integrates an external knowledge retriever with the LLM, fetching relevant text snippets from a database (e.g. vector index) to ground the model’s output. By supplying factual context on-the-fly, RAG allows using smaller base models or shorter prompts without sacrificing performance on knowledge-intensive queries (TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval). This can indirectly improve throughput – the model doesn’t need to internalize or process as much information in its context window if relevant facts are pulled as needed. The extra retrieval step, however, adds latency and can bottleneck the pipeline, especially with large corpora. Recent systems optimize this by overlapping retrieval with generation. TeleRAG, for instance, introduces lookahead retrieval that prefetches the next set of documents from CPU memory to GPU while the LLM is still generating the current answer . By carefully overlapping data transfer and compute, TeleRAG cuts end-to-end RAG latency by up to 1.72× versus standard sequential RAG pipelines . Other frameworks simplify the retrieval stage (e.g. lightweight retrievers, caching frequent queries) to minimize overhead. In some scenarios, if the domain allows, an extensive cache of past answers (cache-augmented generation
) can even substitute for live retrieval (Don't Do RAG: When Cache-Augmented Generation is All You Need ...). Overall, retrieval-augmentation can increase system complexity but, when optimized, enables LLMs to handle knowledge-rich tasks more efficiently than using a massive static model alone.
Speculative Decoding:
A more recent technique to speed up autoregressive generation is to have a secondary “draft” model predict multiple tokens ahead, which the full LLM then quickly verifies – effectively parallelizing the decoding process. This speculative decoding (SD) strategy lets the smaller proxy model generate tentative outputs that are much faster to produce, and only falls back to the large model when corrections are needed ( Optimizing Speculative Decoding for Serving Large Language Models Using Goodput). Critically, the final output remains identical to the original LLM (no quality loss) since the large model ultimately approves every token . SD can yield substantial speedups, but its efficiency in practice depends on the draft model’s accuracy and system load. If the draft predicts many wrong tokens, the large model wastes time rejecting them; similarly, under heavy batching, naive speculative decoding can even increase latency due to wasted work . To address this, an approach called SmartSpec dynamically adjusts how far ahead to speculate based on real-time conditions like model accuracy and server load . By choosing the optimal speculation length per request, SmartSpec realized up to 3.2× lower latency than normal decoding across various model sizes and workloads . Ongoing research is refining speculative algorithms (e.g. tree-based decoding, multi-sample drafting ( EMS-SD: Efficient Multi-sample Speculative Decoding ...)) to further increase the “goodput” – useful tokens generated per unit time – while keeping the overhead of failed speculations low. Speculative decoding exemplifies a clever software-only method to exploit extra compute (a smaller model’s output) for net gain in throughput.
Structured Analysis of Techniques
The above methods often tackle different bottlenecks, and they can be complementary. For instance, a deployment might use quantization and pruning to shrink the model, batch requests together for throughput, cache repeated prompts, and even employ speculative decoding – all at once. The effectiveness of each method varies with context and constraints: quantization and low-level optimizations offer universal speedups (at slight cost to precision), whereas caching or RAG yield big wins only in specific usage patterns (repeated content or knowledge queries). Hardware vs. Software trade-offs: Some gains come from making the model itself lighter (quantization/pruning/distillation) versus improving how we run the model (efficient scheduling, better memory management). Model compression provides permanent latency reduction but may require retraining or careful calibration; in contrast, caching and parallelism demand engineering work but preserve model fidelity. Latency vs. Throughput: Many optimizations trade a bit of added latency or memory for higher overall throughput. Large batch scheduling, for example, maximizes tokens processed per second but can hurt prompt responsiveness if not managed – hence solutions like chunked-prefill scheduling to balance both ( Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve). Similarly, using retrieval or speculative decoding introduces extra steps in the pipeline, which must be overlapped or tuned to ensure they truly accelerate the end-to-end serving time (TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval). Implementation complexity: Techniques like 8-bit quantization or basic batching are already supported in mainstream frameworks (NVIDIA’s Transformer Engine, DeepSpeed, etc.), making them relatively easy to adopt. More complex methods – dynamic sparsity, custom caching hierarchies, or speculative decoding – may require bespoke system integration or maintaining additional models/indices. Nonetheless, recent studies demonstrate that these optimizations can yield order-of-magnitude improvements in feasible context length and throughput (Extending Language Model Context Up to 3 Million Tokens on a Single GPU) and multi-fold throughput gains on real workloads . The choice of techniques thus depends on the specific deployment scenario: model size, available hardware, query patterns, and tolerance for engineering complexity. In practice, a combination of hardware-efficient model tweaks and smart software strategies is employed to push LLM inference towards higher throughput, enabling cost-effective and responsive AI services at scale.
Citations
Hu et al., “I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models,” arXiv, 2024 ( I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models).
Lin et al., “MixPE: Quantization and Hardware Co-design for Efficient LLM Inference,” arXiv, 2024 (MixPE: Quantization and Hardware Co-design for Efficient LLM Inference) .
Qi et al., “Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing,” arXiv, 2025 (Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing) .
Wu et al., “LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Importance Propagation,” arXiv, 2024 (LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation) .
Agrawal et al., “Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve,” arXiv, 2024 .
Zhou et al., “HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading,” arXiv, 2025 (HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading) .
Zheng et al., “Cake: Compute Or Load KV Cache? Why Not Both?,” arXiv, 2024 (Compute Or Load KV Cache? Why not Both?) .
Lee et al., “Extending Language Model Context Up to 3 Million Tokens on a Single GPU (InfiniteHiP),” arXiv, 2025 .
Lin et al., “TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval,” arXiv, 2025 .
Liu et al., “Optimizing Speculative Decoding for Serving Large Language Models Using Goodput (SmartSpec),” arXiv, 2024 ( Optimizing Speculative Decoding for Serving Large Language Models Using Goodput).