Browse all previoiusly published AI Tutorials here.
Introduction and Beam Search Limitations
Beam search has long been the go-to decoding method for many language generation tasks. It explores multiple candidate sequences in parallel and selects the highest-probability sequence overall, which works well for tasks with a single correct answer (like translation) (Text generation strategies) . However, for open-ended generation (chatbot replies, story writing, etc.), beam search often produces degenerate outputs – text that is unnaturally repetitive or bland ( A Contrastive Framework for Neural Text Generation). Research has noted that maximizing likelihood (as beam search does) can lead to dull, repetitive phrases, a phenomenon sometimes called the “beam search curse” or Neural Text Degeneration . In fact, even advanced Large Language Models (LLMs) still suffer from this repetition problem under greedy or beam decoding (A Thorough Examination of Decoding Methods in the Era of LLMs). Beam search outputs on open-ended tasks tend to have lower diversity scores (e.g. MAUVE) and include more redundant content compared to more stochastic methods . These limitations motivate decoding strategies beyond beam search that inject randomness or other heuristics to improve creativity and coherence.
In this report, we explore a range of modern decoding techniques used in 2024–2025 for large-scale language models. We focus on strategies such as random sampling with temperature, top-k sampling, nucleus (top-p) sampling, typical decoding, contrastive search, Mirostat, and speculative decoding. We discuss how each method works, practical implementation details (with an emphasis on Hugging Face Transformers and PyTorch usage), and how they affect generation quality for chatbots, writing assistants, and other creative text generation. We also compare when and why to use each method, highlighting edge cases they address and empirical trade-offs.
Random Sampling and Temperature
The most basic alternative to beam search is multinomial sampling, sometimes called ancestral sampling. Instead of always picking the highest-probability token, the model randomly samples the next token according to the predicted probability distribution (Text generation strategies). This stochastic approach can drastically reduce repetitive loops because even if the top token is the same every time, there’s always a chance to pick a different continuation . In practice, to use random sampling in Hugging Face’s Transformers, you set do_sample=True
when calling model.generate(...)
(with num_beams=1
to avoid beam search) . This gives every token with non-zero probability a chance to be selected at each step.
Temperature is a crucial parameter that controls the randomness or “creativity” of sampling (Decoding Strategies in Large Language Models). Technically, temperature TT rescales the model’s logits (predicted token scores) before sampling: a lower T<1T<1 sharpens the probability distribution (making the model more confident in a few top choices), while a higher T>1T>1 flattens the distribution (making it more likely to pick a less probable token) . Intuitively, low temperature = more deterministic, and high temperature = more random. For example, at T=0T=0 the model would always pick the highest-probability token (greedy decoding), whereas at T=1.0T=1.0 we sample exactly from the model’s learned distribution, and at T=1.5T=1.5 or 2.02.0 the output becomes quite unpredictable. In code, you can simply pass a temperature
value to the generate function (e.g. temperature=0.7
) to modulate this randomness .
Setting an appropriate temperature is important. A higher temperature encourages exploration and can produce more creative or unexpected phrases, which is great for creative writing or brainstorming. But if set too high, the text may become incoherent or off-topic. Conversely, a lower temperature keeps the output focused and relevant (useful for factual answers or coding), but if it's too low (approaching 0), the model may fall back into deterministic habits and even repetition. Many chatbots and writing assistants in 2024 use a moderate temperature (like 0.7 or 0.8) to strike a balance between creativity and coherence. Empirically, aligning temperature to the task is key: one study found that for closed-ended tasks, a low temperature (close to greedy) often yields best accuracy, while for open-ended tasks a higher temperature improves diversity (A Thorough Examination of Decoding Methods in the Era of LLMs) .
Top-k Sampling
A popular refinement of pure sampling is Top-k sampling. This method limits randomness by truncating the probability distribution to the top kk most probable tokens and then sampling only from those (Decoding Strategies in Large Language Models). In other words, at each step the model considers the kk highest-probability next words and ignores all others. The next token is chosen uniformly at random from this top-kk set (or proportionally to their probabilities, typically) . The parameter kk is usually a small number like 50 or 100, ensuring the model doesn’t pick extremely unlikely tokens that could derail the output. To use top-k in practice, one enables sampling (do_sample=True
) and sets top_k=k
in the generation config (for example, top_k=50
to allow 50 candidates) while often keeping temperature
at a moderate value.
Why use top-k? It provides a simple way to boost diversity while capping the risk of nonsense. By always allowing kk options, the model can still say something unexpected, but any token with a rank lower than kk in probability is completely filtered out. This ensures that the random choice is at least somewhat plausible according to the model. Top-k was found to reduce unusual or out-of-context completions that full sampling might occasionally produce. It’s useful in applications where you want some variability but not at the cost of intelligibility – for example, a dialogue system where the response should feel natural but not always the same.
However, choosing kk is a trade-off. A small kk (e.g. 5 or 10) makes the output safer and more coherent but can become repetitive or “stuck” in a loop, as the model keeps cycling through the same few high-probability words (MIROSTAT: A NEURAL TEXT DECODING ALGORITHM THAT DIRECTLY CONTROLS PERPLEXITY | OpenReview). In fact, research shows that with low kk, the model’s perplexity tends to decrease over time, leading to the so-called “boredom trap” where the text becomes trivial and repetitive . On the other hand, a very large kk (say 100 or more) approaches pure sampling; perplexity can increase uncontrollably with length, risking incoherent rambling (the “confusion trap”) . So, kk must be tuned for the model and task. Many practitioners found values like 40 or 50 to be good defaults for GPT-style models: these allow a fair range of word choice without letting in the really bizarre options.
Implementation note: In Hugging Face Transformers, top-k sampling is supported by simply passing the top_k
parameter. For example: model.generate(prompt_ids, do_sample=True, top_k=50, temperature=0.8, max_new_tokens=100)
would generate with top-50 sampling. PyTorch’s lower-level API doesn’t have a one-liner for this, but one can obtain the model’s logits, zero-out all but the largest kk logits, and then sample – the Transformers library basically does exactly that under the hood.
Use cases & tips (Top-k):
Use for: Story or dialogue generation where you want diverse wording but still avoid completely unlikely words. Top-k adds controlled spice to the output.
Edge cases: If you notice outputs looping or repeating excessively, your kk might be too low (model falling into a narrow high-probability groove). Conversely, if outputs start making no sense or veering off-topic, kk might be too high.
Pros: Simple to understand and implement; effectively prunes the low-probability tail that might contain misspellings or irrelevant tokens. Ensures a minimum quality floor for word choices.
Cons: The fixed cutoff might exclude contextually appropriate words just because their probability was slightly below the top-kk threshold. It doesn’t adapt to uncertainty – e.g., if the model is very unsure (broad distribution), you still only allow kk words, potentially cutting off the true best continuation if it happened to be ranked k+1k+1. Often combined with nucleus sampling (top-p) for a more adaptive approach.
Nucleus Sampling (Top-p)
Nucleus sampling, or top-p sampling, is an alternative truncation strategy that is adaptive to the distribution’s shape (Decoding Strategies in Large Language Models) . Instead of a fixed number of tokens, nucleus sampling chooses a probability cutoff pp (a cumulative probability threshold). The model sorts all tokens by probability at the current step and takes the smallest set of tokens whose total probability mass is at least pp . This set is the “nucleus” of likely options. The next token is then randomly sampled from this nucleus. In effect, top-p includes more tokens when the probability distribution is flat (many plausible options), and fewer tokens when the model is confident in a few options .
For example, if p=0.9p=0.9, the model will include tokens until their summed probability ≥ 0.9. In a highly predictable context (say the model strongly expects one particular next word), the nucleus might only contain a couple of tokens (those few choices together already sum to 0.9). In a more uncertain context, the nucleus will be larger, potentially dozens of tokens, until that 90% mass is covered. This dynamic cutoff means the number of candidates varies by step: some steps behave like top-5, others like top-50, etc., depending on how confident or uncertain the model is . Nucleus sampling thus avoids the sharp truncation of top-k and typically produces more flexible and diverse outputs .
Nucleus sampling became popular after Holtzman et al. (2020) introduced it as a fix for the text degeneration issues of beam and top-k sampling. It tends to preserve coherence better than pure random sampling (since it never picks from the extreme tail of the distribution) while still preventing the model from getting locked into one high-probability word sequence ( A Contrastive Framework for Neural Text Generation). Indeed, the SimCTG study noted that nucleus sampling reduces repetitive degeneration of text but can sometimes introduce semantic inconsistencies – the generated text might diverge from the prompt’s intent or contain mild contradictions, since the model occasionally pursues a less likely path . This is the price of added creativity: the model might go off-script.
In libraries: using nucleus sampling in Transformers is as easy as setting top_p=0.9
(or another probability) along with do_sample=True
. It can be combined with temperature as well (e.g. top_p=0.9, temperature=0.8
is a common setting for coherent yet interesting outputs).
Use cases & tips (Top-p):
Use for: Chatbots and writing assistants where you want the reply to usually be sensible but occasionally original. Nucleus sampling is almost a default in many open-ended generation setups due to its balanced performance (Decoding Strategies in Large Language Models) . It shines in creative tasks like storytelling or dialog, and is also used in GPT-style APIs (OpenAI’s API uses top-p under the hood for temperature >0).
Edge cases: If pp is set too low (too restrictive), you may still see repetitive outputs – the nucleus might always consist of the same few words, similar to a low-k scenario. If pp is too high (e.g. 1.0 means no truncation at all), you revert to unconstrained sampling and may get incoherence. Common values are p=0.8p=0.8 to 0.950.95. The original paper recommended around 0.9 as a good compromise.
Pros: Adaptive diversity – the method adjusts to context uncertainty. This typically yields more interesting and fluent text than a fixed top-k cutoff . It often outperforms top-k in human evaluations of open-ended tasks, producing outputs that are both diverse and on-topic.
Cons: The randomness can still cause occasional contradictions or topic drift, as noted in research ( A Contrastive Framework for Neural Text Generation). It also introduces a new hyperparameter (pp) that might need tuning per task. When the model’s predicted distribution is miscalibrated, nucleus sampling might include a token that should have been excluded or vice versa, potentially affecting factual accuracy in applications like long-form QA.
In summary, top-k vs. top-p: top-k selects a fixed number of options, while top-p selects a variable number until a probability mass is reached . Each has unique strengths: top-k ensures a minimum quality for each token choice, whereas top-p ensures a minimum confidence mass is retained. In modern LLM libraries, you can even combine them (some implementations allow both – effectively the intersection: take top-k of the distribution then apply top-p on that set).
Typical Decoding (Locally Typical Sampling)
Typical decoding is a more recent strategy (circa 2022–2023) that takes a different perspective: rather than truncating by probability rank or mass, it selects tokens whose surprise (information content) is closest to what the model normally expects (Locally Typical Sampling - ACL Anthology). The idea comes from information theory and the notion of a “typical set.” In a nutshell, language has an intrinsic entropy in each context – roughly the average surprise value of tokens in that context. Typical decoding tries to choose tokens that are neither too surprising nor too predictable compared to this expected entropy.
In practice, at each generation step the algorithm computes the entropy HH of the model’s current probability distribution (i.e. the expected surprise in bits) . It then identifies the set of candidate tokens whose individual negative log-probabilities (−logP(x)−logP(x)) are close to HH. This forms a locally typical set of tokens for that context. Tokens that are far more predictable than HH (much lower surprise) or far more unpredictable (much higher surprise) are excluded . Finally, the next token is sampled from this typical set (often also enforcing that the set covers a certain cumulative probability mass, analogous to top-p, to avoid it being too narrow). Essentially, typical decoding dynamically filters out tokens that are too obvious or too odd for the model’s current state.
The motivation, as Meister et al. (2023) put it, is that humans tend to choose words with an information content close to the average information content of the context (Locally Typical Sampling - ACL Anthology). That is, we subconsciously aim to be efficient (not say something overly redundant) but also avoid being confusingly random. Their proposed locally typical sampling algorithm enforces this criterion. Notably, their evaluations showed that typical decoding can match the quality of nucleus and top-k sampling in summarization and story generation, while consistently reducing degenerate repetitions . In other words, it often produces fluent, coherent text with fewer occurrences of the model getting stuck in a loop.
From an implementation standpoint, typical decoding introduced a new parameter (often called typical_p
or typical mass in libraries). This parameter (between 0 and 1) controls how much of the distribution’s probability mass to include around the entropy peak. For instance, typical_p=0.2
might mean the algorithm narrows the candidate set to the tokens closest to the entropy until 20% of the probability mass is covered. This is somewhat analogous to top-p but applied on the surprise-ranked list rather than the probability-ranked list. Hugging Face Transformers began supporting typical decoding in the past couple of years – you can enable it via typical_p
in GenerationConfig
or the generate function (when do_sample=True
). It’s sometimes called locally typical sampling in documentation (Understanding Large Language Model Parameters · Voxta Documentation) .
Use cases & tips (Typical decoding):
Use for: Scenarios where you want to minimize repetition and blandness without manual tuning. Typical decoding is good for long-form generation (e.g. story or essay generation) because it adapts at each step to maintain a consistent “interestingness” level. Users have found it helpful to avoid the extreme of safe but dull outputs on one side and random tangents on the other.
Edge cases: If the model’s predictions are very peaked (low entropy) or very flat (high entropy), typical decoding could, in extreme cases, yield an empty candidate set or a very small one. In practice a fallback to greedy or a parameter to ensure a minimum number of tokens is often used. The
typical_p
value also matters: a very low typical_p (say 0.1) will force the model to pick almost always mid-surprise tokens, which might reduce creativity; a very high typical_p (0.9+) makes it behave closer to normal sampling. Many libraries default it around 0.2 to 0.5.Pros: Addresses both repetition and irrelevance. By design it avoids tokens that are too predictable (which cause dull repetition) and too unpredictable (which cause incoherence). This often yields high-quality, flowing text that feels more human-like in balance. Studies showed it reduces repetitive output compared to top-k or top-p alone (Locally Typical Sampling - ACL Anthology). It’s also a one-stop method – you don’t necessarily need separate top-k or top-p if you use typical decoding (in fact, some implementations ignore top-k/p when typical is on).
Cons: It’s a bit less intuitive to tune. The concept of “typical mass” is abstract, and the optimal setting may vary. Also, typical decoding by itself doesn’t explicitly handle all issues – for example, it doesn’t directly consider semantic coherence beyond that inherent in the probabilities. So it could still pick a token that, while of typical surprise, might introduce a subtle topic shift or factual error. In practice, it’s often combined with other constraints (like repetition penalties or stop-word bans) when used in chatbots. Another consideration is that typical decoding is relatively new – not all frameworks had it until recently, but as of 2024, it’s available in popular LLM libraries and even in engine backends like text-generation-inference and llama.cpp.
Contrastive Search (Contrastive Decoding)
While the above sampling methods inject randomness, contrastive search is a deterministic decoding strategy that aims to get the best of both worlds: the coherence of greedy decoding and the diversity of sampling ( A Contrastive Framework for Neural Text Generation). It was proposed in 2022 by Su et al. as part of a framework called SimCTG . Contrastive search works by simultaneously considering the model’s confidence and a penalty for repeating what’s already been generated. At each step, instead of picking the single highest-probability token, it looks at the top-kk candidates (say top-10) and chooses the token that maximizes:
score(x)=logP(x∣context) − α⋅repetition_penalty(x,context)score(x)=logP(x∣context)−α⋅repetition_penalty(x,context)
Here, the second term is a degeneration penalty measuring how similar token xx would be to the existing context ( A Contrastive Framework for Neural Text Generation). In practice, this is often implemented as the maximum cosine similarity between the candidate token’s embedding and any of the context’s token embeddings . If a candidate would produce a token embedding nearly identical to something already in the recent context, it gets a big penalty. The hyperparameter αα controls how strongly we penalize this repetition or redundancy . When α=0α=0, this method reduces to greedy search, and when αα is large, the algorithm will favor tokens that introduce new semantic content (at the risk of slightly lower immediate probability).
Hugging Face Transformers supports contrastive search by allowing you to set penalty_alpha
(this corresponds to αα) and top_k
. For example, model.generate(input_ids, penalty_alpha=0.6, top_k=4, ...)
will trigger contrastive decoding (Text generation strategies) . The model will each time consider the top-4 tokens and pick the one that achieves the best trade-off between model probability and embedding dissimilarity to the context. The result is that the output avoids bland common continuations and avoids repeating phrases, yet remains logically connected to the prior text.
One of the headline results for contrastive search was that it produced longer, more coherent and less repetitive outputs compared to nucleus sampling in open-ended generation benchmarks . For instance, it was shown to generate non-repetitive yet coherent paragraphs where nucleus sampling might drift off-topic or repeat content . Essentially, contrastive decoding explicitly fights the degeneration problem (repetition) by leveraging the model’s own representation space. By ensuring each new token adds some “novelty” (is not just parroting what was already said), it keeps the text moving forward interestingly. This can also help maintain better alignment with the prompt’s semantics, as reported in reasoning and knowledge-intensive tasks (Collections - Hugging Face) – the method prevents the model from taking a high-probability but contextually redundant turn.
Use cases & tips (Contrastive search):
Use for: Dialog systems and writing assistants requiring high coherence over long outputs. Contrastive search is great when you want a single, high-quality completion rather than many varied attempts. It’s been noted to improve factuality and reasoning continuity in LLM outputs (Collections - Hugging Face). For example, when generating a detailed answer or story, contrastive search will make sure the narrative doesn’t keep circling the same points. Some recent research even builds on it (e.g. Decoding by Contrasting Layers (DoLa) to reduce hallucinations by comparing internal layers of the model) (Text generation strategies - Hugging Face), underscoring the value of the approach for maintaining consistency.
Edge cases: Because it is deterministic (like beam search), if you run contrastive search twice with the same input and parameters, you get the exact same output. That can be a downside if you want multiple alternative continuations – you’d have to adjust αα or add randomness manually to get variety. Also, contrastive search requires access to token embeddings or hidden states for similarity computations, which standard model APIs provide (Transformers does this internally). It’s slightly more computationally heavy per step than pure sampling, but for modern models this overhead is usually negligible compared to the forward pass.
Pros: Highly coherent, low repetition. It tackles the primary weakness of greedy/beam (repetition) without resorting to pure randomness. It is parameter-efficient in the sense that authors found one set of αα and kk often works well across many outputs, reducing the need for extensive hyperparameter search (A Thorough Examination of Decoding Methods in the Era of LLMs). It’s also interpretable: you can literally see which token was chosen and which were avoided due to high similarity with context.
Cons: It may sacrifice some creativity. Since the method always chooses from the top-kk probable tokens ( A Contrastive Framework for Neural Text Generation), truly surprising or outside-the-box continuations (which might be outside the top-kk altogether) will never be chosen. Also, if not calibrated, it could lead to slightly stilted language – e.g., always trying to say something new can sometimes make the text overly formal or continuously introduce new concepts without dwelling, which might not always be the desired style. Another empirical finding is that, while contrastive search reduces degeneration, some studies found pure stochastic methods can still achieve higher diversity scores on very open-ended tasks . In fact, one analysis noted that contrastive and other deterministic methods were slightly inferior to sampling in overall quality metrics for story generation, despite eliminating obvious repetition . This suggests a trade-off: if maximum creativity is the goal, a bit of randomness might still be needed.
In summary, contrastive decoding is a powerful strategy for one-pass generation of a high-quality result. It’s being adopted in 2024-era LLM systems for tasks like long-form answers, where you don’t want to sample many and pick – you want the model to just generate a good answer in one go with minimal post-filtering. Tools like Hugging Face make it easy to use: just pick a reasonable penalty_alpha
(around 0.5–0.7 was suggested in the original paper) and a top_k
(like 4 or 10) , and let the model do the rest.
Mirostat: Perplexity-Controlled Sampling
Mirostat is an advanced decoding algorithm that introduces an adaptive, feedback-driven approach to maintain a target level of surprise (perplexity) in the generated text (Understanding Large Language Model Parameters · Voxta Documentation) . Proposed by Basu et al. (ICLR 2021) and refined through 2024, Mirostat addresses a key observation: as text generation progresses, methods like top-k or top-p can cause the output’s perplexity (uncertainty) to systematically drop or rise, leading to the boredom or confusion traps we discussed (MIROSTAT: A NEURAL TEXT DECODING ALGORITHM THAT DIRECTLY CONTROLS PERPLEXITY | OpenReview). Instead of fixing kk or pp, Mirostat dynamically adjusts the effective sampling cutoff each time step to keep the output’s entropy near a target value (here).
In simpler terms, Mirostat acts like a thermostat for text perplexity (hence the name). It has a target entropy (let’s call it ττ) which corresponds to a desired level of “creativity.” At each token generation, it uses the current prediction distribution to choose a token and then updates ττ based on the error between the chosen token’s surprise and the target surprise (here). Concretely, one version of the algorithm does this:
Choose the smallest top-kk such that the entropy of the top-kk portion is about 2τ2τ (this finds a kk that roughly matches the target perplexity) (here).
Sample a token from those top-kk. Compute the surprise of that token (−logPmodel−logPmodel).
If the surprise was higher than desired (text became too random), decrease ττ a bit; if it was lower (text too predictable), increase ττ a bit (here).
Proceed to next token with the adjusted ττ.
This feedback loop ensures that the generated text doesn’t spiral into low-perplexity repetitiveness or high-perplexity incoherence. Mirostat essentially “steers” the generation to maintain a consistent quality (Understanding Large Language Model Parameters · Voxta Documentation) . The authors demonstrated that with a proper target perplexity, one can eliminate virtually all sentence-level repetition (once ττ is above a certain threshold, the model almost never falls into loops) . At the same time, it avoids the confusion trap by not letting perplexity blow up gradually .
In practical use by 2025, Mirostat has been integrated into some LLM libraries and engines. For instance, the popular local LLM engine llama.cpp supports Mirostat, and Hugging Face’s text-generation-inference server (used for production model deployment) also offers Mirostat sampling options . Typically, there are two modes (often referred to as Mirostat 1 and 2, which are slight algorithm variants) and a couple of parameters: mirostat_tau
(the target entropy) and mirostat_eta
(the learning rate for adjustment) . Default values might be τ=5τ=5 and η=0.1η=0.1, but these can be tuned. A mirostat_mode
flag usually toggles it on. When Mirostat is enabled, it overrides other sampling settings (top-k, top-p, etc.) to solely control the process .
Use cases & tips (Mirostat):
Use for: Situations where you want reliable, hands-off generation quality over potentially long texts. Mirostat is particularly appealing if you don’t want to manually search for the best top-p or temperature – you set a target “interestingness” level and the algorithm self-corrects. This is useful in interactive storytelling AI, lengthy conversational agents, or any deployment where the generation may go on for many sentences and you want to avoid drift. It’s also helpful for new models where good decoding parameters are unknown; Mirostat can adapt on the fly.
Edge cases: You still need to choose a good target ττ (perplexity). If set too low, the model will keep adjusting to make the text very predictable (dulling it down); if too high, it will wander into weird territory. The good news is that humans have preferences that perplexity not be “too little or too much” , and Mirostat’s authors note that beyond a certain ττ threshold, repetitions drop off . In practice, one might try a few values (e.g. τ=5,6,7τ=5,6,7) to see which produces the desired level of creativity. Another consideration: because Mirostat constantly changes the distribution cutoff, it’s not reproducible in the sense that the same input can yield different outputs even with the same random seed (since the trajectory of adjustments might differ with small changes).
Pros: Dynamic control and consistency. It balances coherence vs. diversity automatically (Understanding Large Language Model Parameters · Voxta Documentation). Users have observed that Mirostat can produce high-quality outputs without needing to babysit parameters, avoiding both repetitive rants and off-the-rails rambling in long conversations . It’s quite innovative in directly tying to the model’s perplexity, a more global statistic, rather than local rank or probability thresholds. This means it provides a more direct handle on output “quality” as perceived by perplexity.
Cons: It’s a bit more complex and not as widely tested as top-k/p. Some implementations initially had quirks or needed debugging (since it’s iterative). Also, because it relies on a feedback loop, it could in theory react poorly to sudden changes in style or topic (though generally the adjustments are small η=0.1η=0.1 so it won’t oscillate wildly). Computationally, it’s only slightly heavier than top-p (just a bit of extra math each step). Another con: at present, mainstream Transformer APIs (like the basic
generate
in HuggingFace) might not expose Mirostat by default – one might need to use specialized interfaces or community forks. That said, given its promise, usage in 2024–2025 has grown, and tools like KoboldAI, Ooba’s text-gen UI, and others catering to creative writing often incorporate Mirostat as an option.
In summary, Mirostat is like having an **auto-pilot for decodingIn summary, Mirostat is like having an auto-pilot for decoding, continuously nudging the generation to stay at the desired “interestingness” level. This yields outputs that remain coherent and diverse over long passages without constant parameter tweaking by the user. It’s a promising addition to the 2024 arsenal of decoding techniques, especially for applications that generate extended texts or where trial-and-error tuning of top-p/top-k isn’t feasible.
Speculative Decoding for Faster Generation
All the strategies discussed so far focus on what token to generate next. Speculative decoding, by contrast, is about generating tokens more efficiently – it’s a decoding acceleration technique. With the ever-growing size of LLMs, speeding up inference has become crucial in 2024–2025 deployments. Speculative decoding is a clever method to leverage a smaller “assistant” model to help a larger model generate text faster, without changing the final outcome (Faster Assisted Generation with Dynamic Speculation).
Here’s how it works: we have two models – a fast draft model (small) and a slower target model (large, high-quality). The process proceeds in iterations:
Stage 1 (Draft): The small model quickly generates a batch of, say, N tokens autoregressively (one after the other) as a draft continuation.
Stage 2 (Verify): The large model then takes the same starting context and evaluates those N draft tokens in parallel (in one forward pass) to see if it would likely generate them. If the large model “agrees” with the draft (meaning those tokens are all among its plausible predictions), we accept them and move on. If there’s a discrepancy at some token, we roll back to that point and let the large model generate the next token itself (falling back to normal generation for that step) .
By doing this, we essentially skip many single-token steps of the large model. The large model is used to check a chunk of tokens instead of generate them one by one, which can yield big speedups. Importantly, if the draft tokens are accepted, the large model’s output will exactly match what it would have produced had we generated sequentially – thus preserving the final text’s quality and distribution (it’s like a form of clever rejection sampling that doesn’t alter the probability of any sequence) (Faster Assisted Generation with Dynamic Speculation).
Recent implementations, such as in Hugging Face Transformers 4.45.0, have made speculative decoding easy to use: you can simply pass an assistant_model
to generate()
and the library handles the two-stage process . The “speculation lookahead” (how many tokens the draft model generates per iteration) can be fixed or dynamic. In fact, research by Leviathan et al. and an Intel Labs team introduced dynamic speculative decoding, where the number of draft tokens is adjusted on the fly based on recent success (acceptance) rate . If all draft tokens are consistently accepted, the system dares to draft more tokens at once to gain even more speed; if many drafts get rejected, it drafts fewer tokens to avoid waste .
The payoff is significant: speculative decoding can accelerate text generation by 2-3× in throughput in practical settings (Faster Assisted Generation with Dynamic Speculation). Hugging Face reported up to 2.7× speedups on certain tasks using a 6.7B parameter model with a small assistant , and even made this the default for their “assisted generation” pipelines in latest Transformers releases . NVIDIA’s inference libraries and others have similarly adopted speculative techniques for latency gains. All this is achieved without degrading output quality or requiring model retraining – the large model’s accuracy is preserved, since it ultimately vets every token.
Use cases & tips (Speculative decoding):
Use for: Improving latency and throughput in serving large models. For example, real-time chatbot services or applications like code autocompletion (where speed matters) benefit greatly. If you have a powerful 30B model that’s a bit slow, pairing it with a 6B draft model can yield answers almost as good, but much faster.
Edge cases: You need a reasonably well-aligned smaller model. If the draft model’s style or domain is very different, many tokens will be rejected and you gain little. Common practice is to use a smaller version of the large model (e.g., distilled or pruned model) as the assistant so that its token choices have high chance of acceptance. There is also some memory overhead (both models loaded) and implementation complexity, but frameworks handle a lot of it now. If maximum quality is needed and you don’t care about speed (e.g., offline batch generation for a one-time task), speculative decoding isn’t necessary – it’s purely a speed trade-off.
Pros: Dramatic speedups with minimal engineering effort. It’s one of the few ways to get near-linear scaling of generation speed without waiting for new hardware. And it’s mathematically elegant in that it doesn’t alter the probability distribution of outputs (so you don’t have to worry about it introducing biases or errors).
Cons: It introduces more system complexity – two models, scheduling the draft and verify steps, etc. In early implementations, tuning the number of draft tokens (lookahead) was tricky, but dynamic schemes largely solve this (Faster Assisted Generation with Dynamic Speculation). Also, in worst-case scenarios (if the draft model is too poor), you could even lose time due to overhead, but typically one picks an assistant that is about 5×–10× faster per token than the main model, so even a moderate acceptance rate yields gains.
Overall, speculative decoding doesn’t change how you craft the content of the output, but it revolutionizes how fast you can get it. It’s increasingly common in 2025-era LLM APIs and will likely become standard for any deployment where latency is a concern.
Conclusion and When to Use Each Strategy
As large language models power an expanding array of applications, decoding strategies beyond plain beam search have become essential tools for tailoring generation quality. Each method we’ve discussed has its niche, and often they can be combined or sequentially applied. Here’s a quick recap and comparison to guide when to use which:
Beam Search (baseline): Great for tasks with a well-defined correct output (translation, math solutions) because it maximizes likelihood. But for open-ended generation, it can produce high-probability yet generic or repetitive text ( A Contrastive Framework for Neural Text Generation). Use it for deterministic, high-recall requirements, but avoid for creative tasks – beam search can be a “degeneration magnet” in storytelling or chats (A Thorough Examination of Decoding Methods in the Era of LLMs).
Temperature + Sampling: The fundamental way to inject randomness. Use temperature as a knob for creativity vs. coherence (Decoding Strategies in Large Language Models). Nearly all other strategies (top-k, top-p, etc.) assume
do_sample=True
and thus work in tandem with a temperature. As a rule of thumb, start with T≈1.0T≈1.0 for free-form creative writing, and lower it (0.7, 0.5, even 0.1) as you need the model to be more focused or deterministic (for factual Q&A or coding).Top-k Sampling: Use when you want to guarantee a floor on token quality by never allowing extremely low-probability tokens . It’s easy to understand and works well for shorter responses where a fixed diversity level is acceptable. Typical setting: kk between 20 and 100. Watch out for too low kk causing loops , and too high making it moot. Often combined with top-p in practice (many libraries let you use both, truncating by both criteria).
Nucleus Sampling (Top-p): A robust choice for most open-ended generation if you want one method. It adapts to the model’s confidence each step (Decoding Strategies in Large Language Models), usually yielding more natural, less repetitive text than top-k alone. Commonly used in chatbots and creative tools for its balance of coherence and surprise. Typical pp values 0.85–0.95 cover most needs; lower if you notice repetitions, higher if you need more daring outputs.
Typical Decoding: If repetition is a major concern or you find top-p still sometimes veers off, typical decoding can help keep the text “on track” information-theoretically. It’s a good default for story generation and long compositions where you want to avoid dullness without risking nonsense (Locally Typical Sampling - ACL Anthology). Because it’s newer, you might experiment with typical decoding when you’re not satisfied with top-p/k results – you may find it produces a more stable yet interesting narrative. Ensure your generation framework supports
typical_p
and set it modestly (around 0.2–0.5 to start).Contrastive Search: When you need the best single-shot output quality – e.g., an AI assistant’s answer that should be coherent, on-topic, and not repetitive over many sentences – contrastive decoding is a top contender (Text generation strategies) . It removes the luck of sampling and replaces it with a principled selection that avoids saying the same thing twice. Use it for long-form answers, analytical or factual generations, or whenever you can’t afford a rambling answer. Just remember it’s deterministic, so you won’t get variation without changing parameters. Typical usage: set
penalty_alpha
(around 0.5) and atop_k
(5–10). It tends to require larger models with good embeddings; very small models might not benefit as much from the embedding similarity check.Mirostat: Ideal for adaptive, long-haul generation. If you plan to generate say a multi-paragraph story or have a chatbot that converses indefinitely, Mirostat can keep the model from drifting into bad habits over time. It’s a great “set and forget” method: choose a target entropy that gives the style you like, and let it manage the rest. Because it’s somewhat complex, it’s more commonly found in research or enthusiast setups than in plug-and-play APIs, but its benefits are becoming widely recognized. If you’re an advanced user running models locally or with custom code, try Mirostat when other sampling methods either get boring or crazy over long outputs. It literally aims to avoid both those failure modes by design .
Speculative Decoding: Use this when speed is a priority and you have the resources to run a second model. It’s orthogonal to the above methods – in fact, you can use speculative decoding with any of the other strategies guiding the large model’s decisions (greedy, sampling, etc.). The quality doesn’t change, just the latency improves (Faster Assisted Generation with Dynamic Speculation). If you’re serving a production chatbot or an assistant where response time matters, and you can afford to load a smaller draft model, definitely consider speculative decoding. It’s already the default in some inference pipelines , showing how practical it is. The only scenario you wouldn’t bother is offline generation or if your model is small enough that it’s fast anyway.
Finally, it’s worth noting that the optimal decoding strategy can be task-dependent (A Thorough Examination of Decoding Methods in the Era of LLMs) . Studies in the LLM era have found that no single method wins in all cases – for instance, extremely open-ended creative writing might actually score better with some stochasticity (and multiple trials combined with voting, etc.), whereas a technical Q&A might demand a more deterministic approach . Alignment of the model (RLHF-tuned vs. base model) also plays a role in how sensitive the generation is to decoding choices . Thus, professionals often mix and match techniques: e.g., use a bit of beam search for short factual prompts, use nucleus or typical for imaginative prompts, maybe even run two decoders and pick the better output (a form of self-consistency decoding, which is beyond our scope here).
In 2024 and 2025, libraries like Hugging Face Transformers have matured to offer all these methods at your fingertips, letting you experiment with decoding as an important part of prompt engineering. By understanding and leveraging decoding strategies beyond beam search, AI practitioners can significantly improve the quality of chatbot responses, the creativity of writing assistants, and the overall reliability of LLM-generated text – tailoring the output to be as factual, fluent, or imaginative as the use-case demands. Each decoding method is a tool in the toolbox, and the best results often come from knowing which tool to apply when, given the model and the task at hand.