Table of Contents
๐ Introduction
๐ณ๏ธ Majority Voting and Self-Consistency
โฉ Speculative Decoding
๐ช Early Exit Strategies
๐ Dynamic Routing (Mixture-of-Experts)
๐ญ Deployment Considerations: Resource-Constrained vs Enterprise
๐ Summary of Trade-offs
๐ Conclusion
๐ Introduction
Large Language Models (LLMs) have achieved remarkable capabilities, but deploying them in practice presents significant challenges in inference cost and latency. Moreover, ensuring the accuracy and consistency of outputs (avoiding mistakes or hallucinations) often requires clever inference-time techniques beyond the standard greedy or sampling-base methods. In response, researchers and engineers in 2024โ2025 have developed a variety of test-time (inference-time) strategies to either improve LLM output quality or accelerate the generation process (and sometimes both).
This report provides a comprehensive review of four key inference-time techniques: (1) Majority voting methods (exemplified by self-consistency) which run an LLM multiple times and pick the most common answer for improved accuracy; (2) Speculative decoding, which uses a smaller โdraftโ model to pre-generate tokens that the large model then verifies, greatly speeding up generation; (3) Early exit strategies, which allow an LLM to stop its forward pass early or skip certain computations once sufficient confidence is reached, reducing computation; and (4) Dynamic routing (such as sparse mixture-of-experts architectures) where only a subset of model weights are activated for each input, thereby cutting down per-token computation. We will examine how each technique impacts inference cost and latency versus accuracy, and compare their behavior across different model scales โ from open models like LLaMA, Mistral, or the recent DeepSeek, to proprietary systems like GPT-4 and Anthropicโs Claude. We also highlight which methods are more suitable for resource-constrained environments (e.g. running on a single GPU or mobile device) versus large-scale deployments in enterprise data centers. Both academic findings (e.g. arXiv papers from 2024โ2025) and insights from official frameworks (PyTorch, TensorFlow, Hugging Face, NVIDIA, etc.) are cited to provide an up-to-date perspective.
๐ณ๏ธ Majority Voting and Self-Consistency
Majority voting in the context of LLMs refers to generating multiple outputs from the model and then selecting the final answer by consensus โ essentially an ensemble at inference-time. A prominent example is the self-consistency decoding strategy, which was introduced to improve multi-step reasoning tasks. Instead of relying on a single chain-of-thought, self-consistency samples multiple reasoning paths from the model and **extracts the answer from each path, then chooses the most frequent answer as the fin (here). In other words, if a question is answered by the model several times (with different random seeds or slight prompt variations), whichever answer appears most often is taken to be the most reliable. This simple voting mechanism has been shown to significantly boost accuracy on benchmarks requiring reasoning.
However, the benefits of majority voting come at the cost of increased inference computation. By design, one must run the LLM multiple times to gather a sample of outputs. If we sample 5 reasoning paths, we have roughly 5ร the compute cost and, if done sequentially, up to 5ร the latency. (Parallelizing these runs is possible but requires proportional hardware resources.). This limits its practicality in low-latency or compute-constrained settings. Another limitation is that majority voting only applies when outputs can be compared for equality. Tasks with a single correct answer (e.g. math problems, trivia questions) are appropriate โ here one can match strings or final answers across samples to see which answer occurs most. Additionally, vanilla self-consistency ignores the modelโs confidence; it treats every sampled answer equally, which can be suboptimal. Recent research in 2024 has explored refinements like โmirror-consistencyโ and weighted voting to address this. Mirror-consistency, for instance, has the model reflect on why a minority answer disagrees with the majority and can adjust the final decision accordingly, leading to better calibrated confidence (Mirror-Consistency: Harnessing Inconsistency in Majority Voting). Other work proposes using the modelโs own probability estimates to weight votes (a method called self-certainty): instead of simple majority, responses.
Accuracy vs. cost: Majority voting generally improves accuracy โ often dramatically for reasoning tasks โ at a multiplicative cost in inference time. For a model like GPT-3 or LLaMA-65B, self-consistency with 5โ10 samples can yield significantly higher question-answering or problem-solving accuracy. But the latency will scale roughly linearly with the number of samples if using a single processor. This is viable for offline analysis or high-stakes queries where quality is paramount, but not ideal for real-time interactive use. Notably, larger models tend to benefit less from majority voting than smaller models. An empirical survey in late 2024 found that โless powerful models like LLaMA-3-13B and Mistral (8ร22B) gain more consistently from these ensemble methods, while more advanced models like GPT-4 or Claude 3.5 show minimal gainsโ on the (here)ks. Intuitively, a very large model might already get the answer correct on the first try, so re-sampling adds little new information (and might just yield the same answer repeatedly). Smaller models have more uncertainty and varied outputs, so voting can average out their mistakes. This suggests majority voting is particularly useful to boost mid-sized open models to higher reliability, partially closing the gap to giant proprietary models. Indeed, open models in the 7โ13B range have seen notable improvements in benchmarks via self-consistency, whereas GPT-4โs accuracy on those benchmarks cannot be much improved by simply sampling multiple times (itโs often consistently correct or consistently wrong with the same mistake).
In terms of applicability, majority voting is model-agnostic (it treats the LLM as a black box you can query multiple times) which means it can be applied to closed APIs and open models alike. For example, a user with access to GPT-4โs API could call it 5 times and vote on the answers (some researchers have done this to increase reliability on e.g. math questions). The downside is the cost โ for proprietary models, each call incurs additional API charges almost linearly. Open-weight models allow more flexibility: one could run 5 copies in parallel on a multi-GPU server to cut wall-clock latency, something not possible with closed models unless the provider offers an ensemble service. Finally, itโs worth noting that majority voting is only one way to use multiple outputs โ in some settings, taking the most likely (highest probability) output among N samples can outperform majority voting when answers arenโt word-for-word identical. But that requires reading model probabilities or using another method to evaluate outputs. The appeal of majority voting is its simplicity and robustness: it requires no additional training or models, just repeated inference. This makes it a popular baseline for inference-time improvement. In summary, **majority voting can significantly enhance LLM accuracy on discrete tasks, but it offers diminishing returns on very advanced models (Scalable Best-of-N Selection for Large Language Models via Self-Certainty).
โฉ Speculative Decoding
Speculative decoding is a leading technique for speeding up LLM inference without changing the modelโs final output. The core idea is to generate multiple tokens in one go by โguessingโ with a smaller model, then confirm those guesses with the large model, rather than having the large model generate tokens one by one. In a typical setup, we have a large target model (the expensive LLM we ultimately need output from, e.g. GPT-4 or LLaMA-70B) and a smaller draft model (much faster, e.g. a 2B or 7B parameter model). The procedure works in two phases: First, the small draft model is used to quickly propose a block of the next N tokens (for instance, it might predict 3โ5 tokens ahead (How Speculative Can Speculative Decoding Be?). Next, the large model is fed these N tokens in a single forward pass to **verify which of them match what the large model itself would produce. If all draft tokens were correct, the large model has essentially approved those N tokens and we have saved Nโ1 expensive steps. In either case, the large modelโs forward pass yields the same result as if it had generated the tokens itself. By repeating this process iteratively (draft N new tokens, verify with one big model pass, rinse and repeat), we can advance the text generation much faster than the normal token-by-token procell (OpenAI's Predicted Outputs For Faster LLM Responses).
Crucially, speculative decoding is designed to be quality-preserving: the large model ultimately โapprovesโ every token, so the final generated text is exactly what the large model would have produced (Speculative Decoding for 2x Faster Whisper Inference). We are not approximating the final output โ weโre just doing extra speculative work upfront to reduce the large modelโs workload. As one technical blog put it, *speculative decoding works by paying a small additional computation cost to guess a few tokens, and in return gives a significant boost in throughput (TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x | NVIDIA Technical Blog). Empirical results in 2024 bear this out: no loss in accuracy or text quality is observed compared to standard decoding, as long as the verification step is used (i.e. the large model corrects any mistakes). For this reason, speculative decoding has been described as a โdrop-inโ improvement โ one can wrap an existing LLM with a speculative decoding mechanism and get faster responses with ident (Speculative Decoding for 2x Faster Whisper Inference)โ L105-L113ใ.
The degree of acceleration from speculative decoding depends on choosing a good draft model and block size. If the draft model is too small (hence very fast but not very accurate), it will make frequent mistakes that the large model must correct one token at a time, eroding the speed gains. If the draft model is too large/slow or if we generate too many tokens per block, we lose the speed advantage and risk many incorrect guesses. Studies in 2024 have shown that an effective balance is to use a draft model about 10โ20ร smaller than the target model, and to generate in blocks of *3โ5 tokens (How Speculative Can Speculative Decoding Be?). For instance, to accelerate a 70B LLaMA, one might use a 3โ7B parameter draft model and draft, say, 4 tokens per iteration. This draft model will be significantly faster (often >3ร) per token than the 70B, yet still large enough to predict the โeasyโ tokens correctly (Speculative Decoding for 2x Faster Whisper Inference). Indeed, itโs observed that in typical text, 70โ80% of tokens are relatively easy or predictable (common words, syntactic tokens) and a smaller model can handle those, while the remaining 20โ30% (critical or rare tokens) will be caught. This skew allows speculative decoding to achieve big gains: essentially the large model is only really working for that hard 20โ30% of tokens, coasting through the easy parts. The only overhead is that the small model also did some work (which is comparatively cheap). In throughput terms, the small modelโs computation plus the occasional large-model verification still ends up much less than running the large model for every token. Itโs not unusual to see 2โ3ร reduction in latency (TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x | NVIDIA Technical Blog). One caveat is memory: one needs to have both the large and small model available. In an enterprise server scenario this is fine (the overhead of a 2B model next to a 70B is negligible), but on a tight memory environment, loading an extra model might be a concern (we address this in the deployment section). Thereโs also complexity in implementation โ coordinating two models โ but framework support has rapidly grown.
๐ช Early Exit Strategies
While speculative decoding speeds up generation by pre-fetching tokens, early exit strategies aim to speed up inference by reducing the amount of computation per token. The idea is straightforward: not every input or every token may need the full depth of a large transformer network to yield a confident prediction. If the model is fairly sure of the next token halfway through its layers, one could โexitโ at that intermediate layer and output the token, skipping the remaining layers and saving time. Early exits have long been studied in deep networks (e.g. early-exit classifiers in CNNs for easy vs hard images), but only recently have researchers begun applying this to LLMsโ generative decoding. A naive approach to early exiting in an LLM is to attach the final language-model head (the softmax over the vocabulary) not just at the last layer, but at multiple intermediate layers. During inference, after each layer, one could check if the intermediate distribution is confident enough (for example, has low entropy or a high probability for one token); if yes, skip to the output immediately, if not, continue to the next layer. In practice, making this work for LLMs is challenging โ standard pre-trained models are not trained to produce meaningful token distributions until the very last layer. Simply cutting off early would greatly degrade accuracy, as lower layers havenโt fully encoded the needed information.
To enable reliable early exits, researchers in 2024 devised training techniques to teach the model to produce useful outputs at intermediate layers. One notable work is LayerSkip (ICLR 2024) by Sun et al., which introduced a training recipe combining stochastic layer dropout and an early-exit loss to condition the model to โbehaveโ as a smaller model. The result is a faster model: LayerSkip achieved speedups between ~1.3ร and 2.2ร on different tasks. These are promising results, indicating that dynamic depth halting can trim inference cost substantially without sacrificing the benefits of deep computation when itโs truly needed.
There are a couple of flavors of early exit for LLMs: token-level early exit versus sequence-level. Token-level means deciding for each generated token how many layers to use. Sequence-level could mean that for certain easy inputs or prompts, one might use a shallow sub-network globally. Most research has focused on token-level adaptive depth (since even a complex input might have some tokens that are easy to predict โ e.g. in a predictable sequence of words โ and some that are hard). The LayerSkip method above essentially does token-level halting: for each new token, it uses as many layers as necessary until a stopping criterion is met (or it hits the max layer). Another recent approach, โEE-LLMโ (ICML 2024) by Chen et al., trained multi-exit LLMs at scale (EE-LLM | Proceedings of the 41st International Conference on ...). The focus was on ensuring that early exits donโt create a bottleneck in distributed training/inference. Their system allows the model to choose an exit layer dynamically for each token during inference. Empirically, a well-tuned early-exiting LLM will use fewer layers for easy tokens (saving time) and full layers for the hard cases โ a form of adaptive compute. This is somewhat analogous to how humans might not need to reason deeply for obvious predictions but must think longer for tricky ones.
Inference impact: Early exit directly reduces the number of FLOPs per token. If on average a token exits after 60% of the layers, thatโs about a 1.66ร throughput improvement (almost 40% of the computation saved). Unlike speculative decoding, early exiting does entail a slight risk to accuracy if the exits are too aggressive or not well-trained โ essentially it approximates the full model with a truncated one.
In practical deployment, early-exit techniques are still in the experimental stage. They require either custom-trained models (like LayerSkip or EE-LLM variants) or at least some calibration of confidence thresholds for existing models (which is non-trivial โ as mentioned, a vanilla modelโs intermediate confidences arenโt reliable without retraining). We are beginning to see toolkits to support this: for example, the LayerSkip authors released their code (as a PyTorch-based library) for others to experiment with mul (facebookresearch/LayerSkip - GitHub). As these techniques mature, a resource-constrained environment could greatly benefit from them โ imagine a local LLM on a phone that dynamically scales its compute to save battery when possible. In high-end deployments, early exits could be combined with other methods: e.g. one could use an early-exit criterion to decide to stop a large model early and then maybe use a tiny model to finish the sequence (another form of two-stage acceleration). Overall, early exiting can yield roughly 1.5โ2ร faster inference with negligible accuracy loss (LayerSkip: faster LLM Inference with Early Exit and Self-speculative decoding | by SACHIN KUMAR | Medium), but it requires specially prepared models or careful design. Itโs a natural complement to other speed-up techniques and reflects a growing trend to make neural networksโ execution more adaptive at runtime.
๐ Dynamic Routing (Mixture-of-Experts)
Dynamic routing refers to methods where the modelโs architecture can route different inputs or tokens through different subsets of parameters, instead of using the entire network for every inference. The most prominent example in LLMs is the Mixture-of-Experts (MoE) architecture. In an MoE, each layer doesnโt consist of one monolithic feed-forward block, but rather is split into multiple expert blocks (say 16 or 32 expert feed-forward networks per layer). A learned gating network then decides, for each token or input, which expert(s) will be active (Applying Mixture of Experts in LLM Architectures | NVIDIA Technical Blog).
Mixture-of-Experts for LLMs was pioneered in earlier years (e.g. Googleโs Switch Transformer in 2021), but 2024 saw renewed interest and practical deployments. This could explain how GPT-4 attains very high performance: it might consist of several expert networks activated conditionally, rather than a single 180B+ dense network. On the open-source front, Mistralโs Mixtral is a clear example of a successful MoE LLM available to the community. Dynamic routing in these models happens at each token: the gating network (usually a small learned linear layer) looks at the tokenโs hidden state and chooses, say, the top-2 best experts to process. The tokenโs representation is sent to those experts (which are just feed-forward neural subnets), each expert produces an output, and the outputs are combined. Because each token may go to different experts, the workload is spread across many experts for a batch of tokens. This parallelism is great for throughput โ but it also introduces complexity in implementation. If not managed well, some experts might get a lot more tokens routed to them than others (creating a bottleneck where one computation unit is overloaded while others sit idle). Additionally, if the experts reside on different devices (for a very large model), routing tokens between devices incurs communication costs. These are challenges unique to MoE inference. In 2024, significant engineering advances were made to address them.
Inference-time behavior: A well-implemented MoE essentially means at each layer you do k times the work of a single expert (where k=1 or 2) instead of doing the entire layer. If you have E experts in total, the modelโs capacity (parameters) is E times a single expert, but you only used k/E of them for a given token. In Mixtral 8ร7Bโs case, E=8 and k=2, so each token only activates 25% of the parameters at that layer. This yields huge compute savings when E is large. The trade-off is that to utilize those parameters, you must be in a regime where different tokens in a batch go to different experts. MoE is especially powerful in large-batch or multi-user scenarios, typical in enterprise deployment: many tokens being processed in parallel will keep all experts busy and maximize throughput. If you only generate one token at a time (single-user, sequential generation), you might not fully realize MoEโs speed benefits because at each step only k experts do work and the rest are idle (though wall-clock time might still be similar to a smaller model if everything runs on one device). In any case, MoEโs big advantage is **scaling model size without scaling per-token (Applying Mixture of Experts in LLM Architectures | NVIDIA Technical Blog)*. From an accuracy standpoint, MoEs have demonstrated the ability to match or exceed dense models that are much larger than the per-token compute would indicate. As mentioned, Mixtral 46.7B (per-token 12.9B) matche (Mixtral of experts | Mistral AI)B. Googleโs earlier MoE models (Switch, GLaM) showed that, for instance, a model using ~1/4 of the compute of a dense 1T-parameter model could achieve comparable perplexity. In essence, you are leveraging sparsity to get more bang for the buck.
Open vs proprietary and use cases: Open models like LLaMA and Mistral have begun to adopt MoE to remain competitive in performance with much larger closed models. In a resource-constrained environment, MoE is less applicable because you usually cannot afford to load a model that has, say, 4ร the number of parameters as your baseline. For example, if one can barely run a 7B model on a device, one cannot switch to a 28B MoE (with 7B experts) โ even though at runtime it might only use 7B worth of compute, the memory/storage overhead is quadrupled (unless some experts are off-loaded which is complex). Thus, MoEs tend to shine in large-scale enterprise deployments where memory and hardware can be scaled out (multiple GPUs, high-speed interconnects, etc.). The dynamic routing ensures the model โright-sizesโ the computation โ effectively, easy inputs might only activate some experts, and hard inputs activate others, but no single input ever activates everything at once.
Itโs worth noting that dynamic routing in a broad sense could also include techniques like routing queries to different specialized models (Mixture-of-Tasks or Mixture-of-Prompts), or retrieving information from external tools instead of using parametric knowledge. However, in the scope of this review, the main focus is the internal MoE routing, as it has been a major focus in 2024โs literature. **In summary, MoE-based dynamic routing allows LLMs to achieve disproportionately high accuracy for a given inference compute budget by only using a fraction of model parameter (Mixtral of experts | Mistral AI)*. The trade-offs are increased system complexity and memory usage, meaning this approach is primarily used in high-end deployments. When deployed correctly, MoEs can yield enterprise-scale models that are both faster and more accurate than their dense counterparts โ a win-win that justifies the add (Mixtral of experts | Mistral AI)y. We are likely to see more hybrid architectures (perhaps combining MoE with early exits and other tricks) in pursuit of even greater efficiency.
๐ญ Deployment Considerations: Resource-Constrained vs Enterprise
The applicability of these test-time techniques varies greatly between a resource-constrained environment (like a personal device or single-GPU setup) and a large-scale enterprise deployment (cluster of GPUs or cloud service). We highlight the differences and best use cases for each scenario:
๐น Edge or Resource-Constrained Environments: On a small device or a setting with limited compute, latency and memory are at a premium, and one typically canโt afford to multiply the inference workload. Techniques that require multiple forward passes or multiple models (like majority voting or speculative decoding) are less attractive here unless the model is very small to begin with. For instance, doing majority voting with a 7B model on a laptop will linearly increase runtime โ often not feasible if the user needs a quick answer. Similarly, speculative decoding on a single GPU might be limited by memory if you have to load a second model; however, if the draft model is sufficiently small, it can still be worthwhile. In fact, speculative decoding is one of the more promising methods even for local setups: e.g. one could use a 1B draft to accelerate a 7B model โ this might yield, say, ~1.5ร speedup, which can be meaningful for long generations. Early exit strategies could be very useful on-device, but currently require a model that was trained for that purpose (there are not yet off-the-shelf multi-exit LLMs available). If such models become available, a phone or browser running an LLM could automatically save compute on easy portions, improving efficiency. Dynamic MoE routing, on the other hand, is usually not suitable for small environments โ an MoE model might have tens of billions of parameters (most of which are inactive at a time, but they still need to be stored). Unless the device can hold the full model (which might be 4ร or 8ร larger than an equivalent dense model), MoE wonโt be usable. In summary, resource-constrained deployments tend to prefer methods that donโt multiply model size. Quantization and distillation are more common for these cases (orthogonal techniques to reduce model size), whereas majority voting or ensembles are rarely used on edge. Speculative decoding stands out as a technique that provides real speed benefits for long texts, with minimal memory overhead (the draft model can be quite small) โ we may see optimized libraries enabling spec-decoding even in local AI frameworks, given its advantages. Early exits could similarly become an automatic feature โ e.g. a future library might allow a model to skip some internal computation dynamically if a configured confidence threshold is met. All these would help bring down latency on lower-end hardware.
๐น Enterprise and Large-Scale Deployments: In a data center or cloud setting, one often has abundant compute and can parallelize tasks, but the goals are to minimize cost (GPU hours) and maximize throughput and reliability for many users. Here, majority voting and other ensemble methods become more viable if the use-case demands the highest accuracy. An enterprise could run, for example, an LLM 5 times in parallel on 5 GPUs to get an answer with self-consistency โ if that answer is mission-critical (say, a medical or legal query), the extra cost is justified by the higher confidence in correctness. However, in many production systems, a deterministic or single-pass approach is preferred for latency; thus majority voting is more often used in evaluation or offline processing rather than every single user query. Speculative decoding is extremely relevant in enterprise deployment โ in fact, companies like OpenAI have likely integrated it server-side to speed up ChatGPTโs responses (OpenAIโs own blog hints that they continuously improve GPT-4โs latency, and speculative techniques are a known method).