Recent Advancements in Reasoning-Optimized LLMs and Inference-Time Compute Scaling

Jun 16, 2025

Browse all previously published AI Tutorials here.

Introduction and Background
Four Main Approaches to Improving LLM Reasoning
Inference-Time Compute Scaling Methods
- s1: Simple Test-Time Scaling - Budget Forcing with Wait Tokens
- Test-Time Preference Optimization (TPO)
- Thoughts Are All Over the Place - Mitigating Underthinking
- Trading Inference-Time Compute for Adversarial Robustness
- Chain-of-Associated-Thoughts (CoAT)
- Step Back to Leap Forward - Self-Backtracking
- Scaling Up Test-Time Compute with Latent Reasoning
- Can a 1B LLM Surpass a 405B LLM - Compute-Optimal Scaling
- Inference-Time Computations for Reasoning and Planning - Benchmark & Insights
- Inner Thinking Transformer (ITT) - Dynamic Depth Allocation
- Test-Time Scaling for Code Generation (S*)
- Chain-of-Draft (CoD)
Industry Applications and Framework Support
Trade-offs, Cost Considerations, and Emerging Trends

Introduction and Background

Large language models (LLMs) have made great strides in complex reasoning tasks by generating and evaluating intermediate steps – an ability often called “reasoning” or “slow thinking.” Unlike basic Q&A models that directly output an answer, reasoning-optimized LLMs break a problem into sub-steps or “thoughts” (sometimes explicitly shown as a chain of reasoning) before finalizing an answer (The State of LLM Reasoning Models). Recent research has focused on improving LLM reasoning capabilities, and in general there are two broad strategies: (1) increasing training compute (e.g. special training/fine-tuning to instill reasoning skills) or (2) increasing inference-time compute (allowing the model to do more work at inference to solve a query) . The latter, known as inference-time scaling or test-time scaling, is analogous to giving the model more “time to think” when answering a question . This review concentrates on recent advances in inference-time compute scaling techniques for reasoning, especially those emerging after the release of DeepSeek R1 in January 2025 . We will first outline the main categories of methods for improving reasoning in LLMs, then dive into detailed developments in inference-time scaling, followed by industry applications, code examples, and analysis of trade-offs.

Four Main Approaches to Improving LLM Reasoning

Current methods to enhance reasoning in LLMs can be grouped into four main approaches (often used in combination) :

Inference-Time Compute Scaling – Techniques that improve reasoning without changing model weights, by using more computation during inference (e.g. generating multiple solutions or multi-step reasoning per query). These methods trade extra compute for better answers and can, in principle, be applied to any pretrained model (The State of LLM Reasoning Models) . This category includes strategies like chain-of-thought prompting, self-consistency (majority voting), tree search, iterative refinement, etc., which effectively let the model “think longer” during generation.
Reinforcement Learning (RL) – Training-based approaches where the model learns better reasoning via RL, using reward signals from problem-solving tasks (math, code, etc.). RL can encourage strategic thinking and self-correction abilities, as seen in OpenAI’s o1 model (which used RL to achieve advanced reasoning) . Pure RL approaches can yield powerful reasoners but are challenging due to high compute cost and potential issues like reward hacking or instability .
Hybrid RL + Supervised Fine-Tuning (SFT) – A combination of supervised training and reinforcement learning. Typically, the model is first supervised-finetuned on high-quality reasoning data (e.g. human-written solutions or chain-of-thoughts), then further refined with RL to target specific reasoning behaviors (The State of LLM Reasoning Models). This hybrid can stabilize training (leveraging SFT to provide a strong base) while still using RL to push the model’s reasoning performance beyond what supervised data alone can achieve .
Supervised Fine-Tuning and Distillation – Approaches that rely on supervised learning, sometimes augmented by knowledge distillation. Here an LLM is finetuned on curated reasoning datasets, which may be generated by a stronger model (making it a form of distillation) . For example, a large model’s chain-of-thought outputs can serve as training data to teach a smaller model to reason. This improves the smaller model’s reasoning by imitating the larger model’s thought process. (This differs from classic distillation in that often only final answers or explanations are used, not full logits .) Such methods yield models that inherently produce step-by-step solutions, albeit the inference-time compute they require scales with the length of those solutions (since longer answers mean more tokens) .

All four approaches above aim to produce LLM “Reasoners” that can tackle multi-step problems like math word questions, coding challenges, logic puzzles, etc., by generating intermediate reasoning steps. Notably, approaches 2–4 (RL, RL+SFT, SFT/Distillation) result in models that by design output longer explanations or chains-of-thought, so they implicitly use more inference compute (longer outputs cost more) (The State of LLM Reasoning Models). However, our focus here is on methods that explicitly control or increase inference-time computation beyond just having a longer response . In the next section, we explore the latest inference-time compute scaling methods in detail, organized by specific techniques and papers.

Inference-Time Compute Scaling Methods

Inference-time scaling methods aim to boost reasoning by allocating more computation during the model’s response generation. Intuitively, this is like allowing an AI to use extra “brain power” on demand, much as a person might take more time or scratch paper to solve a hard problem. Techniques range from simple adjustments in decoding to complex multi-step search procedures. Below we review recent advancements (mostly from 2024–2025) in this area, including theoretical innovations and how they are implemented.

1. s1: Simple Test-Time Scaling - Budget Forcing with Wait Tokens

One notable work is s1: Simple test-time scaling (Muennighoff et al., 2025) ( s1: Simple test-time scaling), which sought the simplest possible method to replicate the powerful reasoning seen in OpenAI’s o1 model. The technique they introduce is budget forcing, implemented via a special “Wait” token in the model’s outputs . The idea is straightforward: when the model is about to conclude an answer, it instead appends a “Wait...” prompt to itself, prompting additional reasoning before finalizing the answer. By inserting one or more “Wait” tokens, the model is forced to lengthen its reasoning process or, conversely, the generation can be forcefully stopped early to simulate a constrained “time budget” . This method acts like a knob to control how much the model thinks during inference. Importantly, the authors found that appending “Wait” often makes the model double-check and correct its reasoning, leading to higher accuracy . They created a small high-quality dataset (s1K) of 1,000 reasoning traces and supervised-finetuned a 32B-parameter model on it to respond to “Wait” appropriately . The resulting model s1-32B, equipped with budget forcing, achieved remarkable results: it outperformed OpenAI’s o1-preview by up to 27% on challenging math benchmarks (MATH and AIME24) . Moreover, by increasing the number of “Wait” tokens (i.e. scaling up inference steps), s1-32B’s performance could be extrapolated even beyond its finetuned capability – e.g. raising accuracy on AIME24 from 50% to 57% by allowing extra “thinking” time . In essence, s1 demonstrated that even a relatively small custom dataset and a simple token-based control can yield a reasoning boost rivaling far larger models, just by smartly allocating inference compute.

Implementation strategy: The “Wait” token approach can be implemented by modifying the decoding loop. For example, one can monitor the generated tokens and if an end-of-answer is detected too early, inject a special token like <WAIT> and continue generation. Below is a simplified pseudocode illustrating the concept of parallel and sequential test-time scaling with a scoring function (representing a reward model or verification step):

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "your_reasoning_llm"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Question: [some complex problem]? Solve step by step."
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)

## Parallel inference-time scaling: generate N candidate answers (increased compute via multiple samples)
N = 5
outputs = model.generate(**inputs, do_sample=True, num_return_sequences=N, max_length=256)
candidates = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

## Suppose we have a function score_answer(ans) -> higher is better (could be a learned reward model)
best_answer = max(candidates, key=score_answer)

## Sequential inference-time scaling: if model tries to end early, append "Wait" and continue
response = ""
for step in range(5):  # allow up to 5 "Wait" extensions
    output = model.generate(**inputs, max_length=50)  # generate up to 50 tokens
    text = tokenizer.decode(output[0], skip_special_tokens=True)
    if "<EOM>" in text or "Final Answer:" in text:
        # If the model indicates an end-of-thought, append "Wait" token and continue generation
        inputs = tokenizer(prompt + text + " Wait.", return_tensors='pt').to(model.device)
        continue  # loop again to extend reasoning
    response = text
    break

print("Best answer (parallel):", best_answer)
print("Extended reasoning answer (sequential):", response)

In practice, frameworks like Hugging Face Transformers (built on PyTorch) make it easy to generate multiple outputs (num_return_sequences=N) and to manipulate prompts for iterative refinement as shown. The score_answer could be a separate reward model evaluating each candidate (Reasoning in Granite 3.2 using inference scaling - IBM Research) .

2. Test-Time Preference Optimization (TPO)

While most inference scaling methods focus on accuracy, Test-Time Preference Optimization (TPO) (Li et al., 2025) targets alignment: guiding a model’s outputs to better match human preferences at inference time, without any weight updates ( Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback). TPO is an iterative refinement framework: the model generates an initial answer, then a preference model or heuristic provides textual feedback (like a critique or suggestion) on that answer, and the model revises its output accordingly . Crucially, instead of using numerical reward signals or requiring RL training, TPO translates reward model outputs into natural language feedback (e.g. “The response should be more detailed on X, and avoid using Y language”) which the original LLM can understand and act on . By iterating this process (generate → get feedback → regenerate), the LLM “aligns” its response on the fly to the desired style, safety, or other preferences . Empirical evaluations showed that after only a few rounds of TPO, an initially unaligned model (Llama-3.1-70B-SFT) surpassed the performance of its aligned counterpart (Llama-3.1-70B-Instruct) on preference tests . In other words, TPO can take a vanilla model and make it perform like an instruction-tuned model during inference, simply by using feedback loops. It was also found to scale efficiently with the “search width and depth” – meaning more feedback iterations or exploring multiple drafts can further improve outcomes with manageable cost . TPO represents a novel use of inference-time compute for on-the-fly alignment, showing that even without additional training, an LLM can be steered toward preferable outputs through iterative self-correction.

3. Thoughts Are All Over the Place - Mitigating Underthinking

A January 2025 study by Wang et al. observed a shortcoming in advanced reasoning models like OpenAI’s o1: a tendency to rapidly jump between different solution paths without fully pursuing any – a phenomenon the authors term “underthinking” ( Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs). Despite o1’s impressive multi-step reasoning, it often didn’t dig deep enough on promising paths, leading to shallow or incorrect answers on tough problems . In Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs, they systematically analyze this behavior and introduce a remedy: a decoding strategy with a Thought Switching Penalty, abbreviated TIP . TIP works by detecting when the model’s output is switching to a new line of thought (for example, abandoning a calculation midway to try a different approach) and slightly penalizing such switches in the model’s token probabilities (The State of LLM Reasoning Models). By reducing the likelihood of abruptly changing course, the model is encouraged to stick with a reasoning thread longer and explore it thoroughly before considering alternatives . This simple modification, implemented at inference, led to notable accuracy gains on challenging math datasets . Impressively, TIP required no model retraining or fine-tuning – it is a pure decoding-time intervention. The researchers reported that adding a thought-switch penalty improved correctness across multiple benchmarks, indicating the model was indeed delving deeper into problems and overcoming the “underthinking” issue . In sum, this work identifies that even state-of-the-art reasoners can suffer from superficial reasoning, and that careful control of inference (in this case, biasing the decoding process to favor continued thoughts) can yield more coherent and successful problem solving .

4. Trading Inference-Time Compute for Adversarial Robustness

Inference-time reasoning not only improves accuracy, but it can also bolster robustness. An OpenAI research (Zaremba et al., 2025) asked: if an LLM “thinks longer,” does it become harder to trick with adversarial prompts? Their findings suggest yes – scaling up inference-time compute leads to improved resilience against adversarial attacks in many cases ( Trading Inference-Time Compute for Adversarial Robustness). They experimented with reasoning LLMs under various prompt-based attacks and observed that as the models were allowed more reasoning steps (for instance, using chain-of-thought prompting or iterative self-reflection), the success rate of attacks dropped, often approaching zero on many attack types . Notably, this was achieved without any adversarial training or fine-tuning – purely by leveraging the model’s existing reasoning ability and giving it more internal deliberation time . In practical terms, an attack that might derail a quick answer could be thwarted when the model takes multiple steps to verify or justify its answer, effectively catching inconsistencies or malicious twists. There were important exceptions: certain attack strategies (like ones exploiting the model’s policy choices or attempting to trick it into “thinking less” or getting stuck on irrelevant details, dubbed “Nerd Sniping”) could still succeed . Thus, inference scaling isn’t a silver bullet for all adversarial inputs. But overall, the research provides “initial evidence that reasoning models such as o1 become more robust to adversarial attacks as they think for longer.” (Trading inference-time compute for adversarial robustness | OpenAI) In other words, more computation per query can act as a defense mechanism. This insight is influencing safety strategies – rather than solely relying on fine-tuned filters, simply enabling a model’s multi-step reasoning (when a query is suspected to be adversarial or tricky) might make it inherently safer .

5. Chain-of-Associated-Thoughts (CoAT)

Most chain-of-thought methods have the model generate a single linear sequence of reasoning steps. The Chain-of-Associated-Thoughts (CoAT) framework (Pan et al., 2025) instead marries classical search algorithms with the LLM’s generative prowess ( CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning). CoAT introduces an associative memory that the LLM can read from and write to during reasoning, combined with a Monte Carlo Tree Search (MCTS) procedure to explore multiple reasoning paths . Think of it as the model building a search tree of possible “trains of thought,” while continually updating a shared memory of important facts or partial results it has discovered (The State of LLM Reasoning Models) . The associative memory serves as a dynamic knowledge base – as the model considers one path, it can store intermediate insights (“clues”) that might be useful if it backtracks and tries an alternate path, mimicking how humans associate ideas when thinking. MCTS then guides the exploration, balancing depth (following a path deeply) versus breadth (trying different approaches), using the memory to avoid repeating mistakes or forgetting earlier clues . In experiments across various tasks, CoAT significantly improved accuracy, coherence, and diversity of solutions compared to standard single-chain reasoning . By expanding the search space of possible thoughts and allowing the model to dynamically incorporate new information, CoAT achieved more comprehensive reasoning without additional training on that specific process . This showcases how integrating search-based planning algorithms at inference can push LLMs closer to human-like problem solving – recalling relevant knowledge, revisiting earlier steps, and exploring alternatives – all within one coherent framework.

6. Step Back to Leap Forward - Self-Backtracking

Inspired by how humans solve problems by occasionally backtracking (going back to reconsider earlier steps when current approach fails), Self-Backtracking methods enable an LLM to undo or revise parts of its reasoning autonomously. In Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of LMs (Yang et al., 2025), researchers implemented a system where the model can mark a point in its reasoning to “step back” to later . During training, the model learned to insert a special token (e.g. ⟂ or the word “backtrack”) when it sensed a reasoning dead-end, and how to resume from that point with an alternate attempt . At inference, a tree-search procedure utilizes this: the model can generate a reasoning path, and if it outputs a backtrack token, the search branches off from the last known good point and tries a different reasoning route . Notably, this approach does not rely on external reward models for evaluating each step (unlike many search-based methods that need a value or reward model to guide them) . The result is a built-in search capability: the LLM effectively learns when and where to abandon a line of thought and explore alternatives. Empirical results were striking – the self-backtracking approach improved reasoning accuracy significantly, in one case noting a >40% performance gain over a baseline that only followed the single best path found by supervised fine-tuning (Paper page - Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models). In essence, giving the model a “self-corrective rewind button” made it much more effective at solving complex tasks, as it could recover from mistakes and try a different way, all during inference. This method does require special training (to teach the model the backtracking token usage), but the heavy lifting of exploring alternatives happens at inference via compute-intensive search. It’s a compelling example of trading more inference compute for higher reliability, without needing an outside judge model.

7. Scaling Up Test-Time Compute with Latent Reasoning

Most inference scaling methods make the model generate more tokens (longer explanations, multiple answers, etc.). Geiping et al. (2025) propose an alternative: increase computation without increasing output length, by doing more work in the model’s latent space (The State of LLM Reasoning Models). Their approach, Latent Recurrent Depth, introduces a special block within the transformer that can be iterated multiple times internally for a given input . In other words, instead of stacking more transformer layers (which would be training-time scaling), they allow the model to re-use a block of layers repeatedly at inference to deepen its computation on a token representation ( Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach). This effectively turns the model into a recurrent network that can “think” for an arbitrary number of steps per token by looping through the same parameters. Unlike typical chain-of-thought, this latent reasoning doesn’t produce an explicit step-by-step text that humans can read – it’s all internal to the model’s hidden state. The authors note this has some advantages: it requires no special training data (the model is trained normally, aside from architecture changes) and can work even with small context windows, since the iterative reasoning isn’t stored as additional tokens . It can also, in principle, capture types of reasoning not easily expressed in natural language (since the latent state isn’t constrained to words) . They built a 3.5B parameter “Deep Reasoning LM” with this recurrent depth feature and found that by increasing the number of latent iterations at test time, the model’s performance on reasoning benchmarks improved – in some cases dramatically – corresponding to what one would expect from a much larger (e.g. 50B) model in standard setup . Essentially, a smaller model given enough internal compute could rival a bigger model’s reasoning ability . The drawback noted is a lack of interpretability – because the reasoning steps aren’t output as text, we can’t see how it’s solving the problem, which is one benefit of explicit chain-of-thought methods . Nonetheless, this work shows a promising direction: architectural innovation for dynamic-depth transformers, where the model allocates more layers/iterations to hard tokens and fewer to easy ones, achieving better accuracy without always having to output lengthy explanations.

8. Can a 1B LLM Surpass a 405B LLM - Compute-Optimal Scaling

A provocative question posed by Liu et al. (2025) was whether a tiny model, armed with the right inference strategy, could beat a giant model that doesn’t use such strategies. Their paper, Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling, demonstrates that in some cases the answer is yes ( Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling). They examine how different factors – the policy model (the main LLM generating answers), the process reward model (PRM) used to evaluate or choose among outputs, and the difficulty of the problem – all influence the optimal way to spend a fixed inference compute budget . Through extensive experiments on math benchmarks (MATH-500 and AIME24), they found that with a compute-optimal test-time scaling (TTS) strategy, extremely small models can indeed outperform much larger ones . For example, a carefully orchestrated test-time routine enabled a 1B parameter model to exceed the performance of a 405B model (GPT-4 sized) on a math test . They also showed a 0.5B model beating a fine-tuned GPT-4o, a 3B model surpassing a 405B, and a 7B model even outdoing DeepSeek-R1 – all while using similar or less total compute than those larger models spent generating one answer . How is this possible? The smaller models were paired with efficient search and evaluation procedures at inference: for instance, generating many candidate solutions and using a strong PRM to pick the right one, or dynamically adjusting how many solutions to sample based on problem complexity. Larger models, if they only generate a single answer, can miss the correct solution or make careless errors that a thorough search by a small model could catch. The takeaway is that inference-time algorithms can be as important as model size. By smartly allocating a compute budget – say, deciding whether to do 1 run with a 405B model vs. 100 runs with a 1B model and a voting mechanism – one might achieve better results with the latter in some domains . This research provides a framework to decide how to trade model size for inference computation optimally. It underscores a theme: the era of purely judging models by parameter count is over; we must also consider how they use compute at runtime.

9. Inference-Time Computations for Reasoning and Planning - Benchmark & Insights

Given the flurry of inference-time reasoning methods, Parashar et al. (2025) introduced Sys2Bench, a benchmark to systematically evaluate them ( Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights). In Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, they assess a variety of techniques (chain-of-thought prompting, tree-of-thought search, self-consistency voting, etc.) across eleven diverse tasks covering arithmetic, logic, commonsense reasoning, algorithmic puzzles, and high-level planning . This study provides a broad view of how different methods stack up and, importantly, the trade-offs between compute cost and performance . One key finding is that simply throwing more inference computation at a problem does not guarantee a win across the board . No single technique dominated all tasks – for instance, a tree-search might excel at math proofs but underperform a simpler chain-of-thought on commonsense questions, whereas self-consistency might help for logic puzzles but not for planning tasks . In other words, the effectiveness of inference scaling is context-dependent. They also highlight diminishing returns in some cases: certain tasks saturate in performance after a moderate amount of inference effort, suggesting that beyond a point extra steps are wasted compute . This benchmark serves as a reality check and a guide for practitioners. It encourages focusing on adaptive inference, where the approach is tuned to the task at hand (e.g., use a heavy search only for tasks known to be very hard, otherwise use a cheaper method). The authors conclude that scaling inference-time compute is a powerful tool but not a silver bullet – it should be applied judiciously and often in combination with other improvements . Their work also facilitates future research by providing a common yardstick to measure new inference-time reasoning methods against a variety of challenges.

10. Inner Thinking Transformer (ITT) - Dynamic Depth Allocation

A novel architectural approach to inference scaling is the Inner Thinking Transformer (ITT) by Chen et al. (2025). ITT modifies the standard Transformer architecture to allow dynamic depth per token at inference ( Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking). The motivation is that not every part of the input requires equal “thinking” – some tokens (like numbers in a math problem, or tricky logic phrases) are more challenging and should receive more processing, while others (simple words or known facts) need less. ITT achieves this through three mechanisms : (1) Adaptive Token Routing – tokens deemed “difficult” (detected via signs like large attention or gradient spikes in intermediate layers) are routed through additional layers multiple times, effectively giving them extra compute ; (2) Residual Thinking Connections – analogous to doing several mental passes, the model can refine a token’s representation iteratively by looping it through the same layer and adding the updates; and (3) Thinking Step Encoding – a way to mark which iteration of processing a token is in, so the model can differentiate a token’s first-pass representation from a later refined representation . In practice, ITT allows the model to focus compute where it’s most needed during inference, without expanding the model’s size. In experiments with relatively small models (162M to 466M parameters), ITT was able to reach near the performance of a standard Transformer almost 3× its size, and did so with significantly less training data . For example, a 162M-parameter ITT model achieved 96.5% of the performance of a 466M normal transformer on a suite of reasoning tasks, while using 43% less training data . It also outperformed naive “transformer with loops” baselines on 11 different benchmarks . These results imply that fine-grained, token-level inference scaling (as opposed to whole-sequence scaling) is highly effective – the model essentially learns to spend its “thinking budget” exactly where needed. From a theoretical standpoint, this touches on ideas of conditional computation and algorithmic depth: easy parts of the input get shallow processing, hard parts get deep processing, all within one model. For implementation, such dynamic routing can be done in frameworks like PyTorch by controlling layer execution per token (though it’s non-trivial and often requires custom CUDA kernels for efficiency). ITT’s success opens a path to more compute-efficient reasoning models, where we get the benefits of huge model depth but only use it sparingly when required.

**11. Test-Time Scaling for Code Generation (S*)**

Reasoning in code generation often means writing code, running it, and debugging – a process that naturally fits iterative refinement. S** (pronounced “S star”) is a test-time compute scaling framework specifically for code generation tasks (Li et al., 2025). It combines parallel and sequential inference scaling: first, the model generates multiple candidate programs in parallel, then it enters a loop of executing those programs on test cases and having the model fix any errors (sequential refinement) (The State of LLM Reasoning Models) . Essentially, S* turns an LLM into a coding competitor that writes some code, tests it, debugs it, and potentially compares solutions. Concretely, S* works in two stages : (1) Generation & Debugging: The model produces, say, 5 different solutions for a given coding problem. Each solution is run against a set of unit tests (included in the prompt or provided as examples). If a solution fails a test (errors or wrong output), the error trace and results are fed back into the model (appended to the prompt) to prompt a correction, generating a new improved version of that solution . This can loop until the solution passes all tests or a time limit is reached. (2) Selection: If multiple solutions pass the public tests, the model then needs to pick the best one to output. Rather than random choice, S* uses an adaptive input generation approach: it asks the model to come up with an additional test case that would distinguish between two candidate solutions (i.e., find an input where they might behave differently) . It then runs both solutions on that new input and sees if one fails. This is akin to adversarial testing between the solutions. By pairwise tournament of candidates with model-generated test cases, S* identifies the most correct solution (or determines they’re equivalent and picks one) . This clever selection mechanism reduces the chance of choosing a wrong solution that just happened to pass the limited tests. The results with S* are impressive: it consistently improved code generation accuracy for models of various sizes (^∗: Test Time Scaling for Code Generation). For instance, using S*, a 3B parameter code model was able to outperform OpenAI’s GPT-4o-mini on a coding benchmark . It also enabled models that are not specifically trained for reasoning to outperform those that are – e.g., GPT-4o-mini (which presumably has reasoning tuned off) with S* surpassed o1-preview (a reasoning-tuned model) by 3.7% on the LiveCodeBench challenge . Furthermore, applying S* to one of the strongest reasoning models (DeepSeek-R1-Distill-Qwen-32B) pushed its score to 85.7% on that benchmark, nearly reaching the level of OpenAI’s top code model (o1-high reasoning effort, at 88.5%) . These gains underline how tools + inference-time computation can raise the ceiling of performance, even in domains where LLMs are already strong. S* essentially integrates a testing loop into the generation process, highlighting a practical industry use-case: AI coding assistants that not only write code but test and verify it in one go.

12. Chain-of-Draft (CoD)

While many methods above make LLMs do more (generate more steps, more candidates, etc.), Chain-of-Draft (CoD) takes a different angle: do the same (or more) with less output. Proposed by Xu et al. (2025) and inspired by human note-taking, CoD has the model generate minimalistic intermediate steps instead of verbose ones ( Chain of Draft: Thinking Faster by Writing Less). Traditional chain-of-thought prompting often encourages the model to spell out every detail (“think step by step…”), which, while effective, is very token-intensive. Humans, on the other hand, often jot quick drafts or outlines of reasoning – just enough to not lose the train of thought – before solving a problem. CoD mimics this by prompting the LLM to produce concise “draft thoughts” that capture the essential reasoning, then arrive at the final answer . For example, instead of a 100-token detailed explanation, the model might write a 10-token summary of the key idea, then jump to the answer. The striking result: Chain-of-Draft matched or even surpassed Chain-of-Thought in accuracy while using only ~7.6% of the tokens . That is a 92% reduction in solution length for equal or better performance, across various reasoning tasks . This has huge practical implications – it means far less latency and cost per query (since API costs scale with token count), making “slow thinking” economically viable. Essentially CoD finds a sweet spot between zero reasoning and fully verbose reasoning: the model still does multi-step reasoning, but it internalizes or abbreviates most of it, outputting just a terse representation of the process. The challenge is ensuring the model doesn’t omit critical details that affect the answer. The authors addressed this through prompt engineering and possibly some finetuning so that the model’s drafts remain informative enough. CoD can be seen as an efficiency-oriented inference-time technique, trading verbosity for conciseness. In a way, it “compresses” the chain-of-thought. The fact it can maintain accuracy suggests the extra words in a normal chain-of-thought aren’t always necessary – the model can keep track of details internally. For deployment, a CoD approach could be toggled as a “fast reasoning mode” that yields cheaper but still accurate results, an attractive option for industry applications where cost is a factor (Less is more: How 'chain of draft' could cut AI costs by 90% while ...).

Industry Applications and Framework Support

The rapid progress in inference-time reasoning techniques has already made its way into industry and large-scale deployments. AI providers are keen to offer reasoning-as-a-feature in their models, often giving users control over how much inference compute to use (“fast mode” vs “deep reasoning mode”). For example, Anthropic’s latest Claude and other commercial models introduced tunable reasoning modes – Claude 3.7 “Sonnet” and Grok 3 now have a “thinking mode” toggle that, when enabled, engages more thorough inference-time reasoning for better answers (The State of LLM Reasoning Models) . If the user doesn’t need elaborate reasoning (and wants a quick response), they can disable it, saving costs. OpenAI’s approach was to offer separate models: GPT-4 vs. GPT-4o (optimized), or the o1 reasoning model vs. standard models, though future releases aim to unify this . Even IBM’s Granite series, an enterprise LLM, added an explicit “reasoning” toggle in version 3.2, which internally activates an inference-scaling pipeline . This trend, dubbed “thinking on demand”, shows that reasoning is becoming an optional service that can be turned on when needed .

Several industry case studies highlight the benefits. IBM Research reported that by applying inference scaling techniques (specifically a combination of an LLM, a process reward model, and a search algorithm), their 8B-parameter Granite-3.2 model saw “upwards of 20 point” jumps on code and math reasoning benchmarks (Reasoning in Granite 3.2 using inference scaling - IBM Research) . This boost allowed Granite-3.2 (8B) to exceed the performance of larger proprietary models like GPT-4o-0513 and Claude-3.5 on those tasks . Essentially, IBM leveraged a Tree-of-Thought style search guided by a reward model (what they call a PRM) to enhance Granite’s reasoning. They describe that “you can enable reasoning using inference scaling by combining three ingredients: an LLM, a PRM, and a search algorithm to explore possible reasoning paths” – which is exactly the kind of setup many of the research papers above use. IBM’s integration of this into a product suggests that even smaller models can be turned into powerful reasoners with the right inference-time recipe, saving the need to train gigantic models from scratch.

On the engineering side, mainstream AI frameworks have begun supporting these advanced inference workflows. PyTorch and TensorFlow (often via high-level libraries like Hugging Face Transformers) provide features to facilitate multi-step generation. For instance, Hugging Face’s generate API allows beam search, sampling multiple outputs, and temperature control, which are building blocks of methods like self-consistency and tree search. Developers can also utilize callbacks or custom decoding loops to implement iterative refinement (as we illustrated with pseudocode earlier). PyTorch’s dynamic computation graph is particularly handy for methods like ITT or latent loops, where the model’s forward pass can include conditional logic (e.g., routing certain tokens through layers multiple times). On the Google side, the T2I framework and seq2seq models in TensorFlow can also be coerced into multi-round generation with tf.while_loop constructs, albeit with less flexibility than PyTorch. Industry toolkits are emerging: for example, NVIDIA NeMo and Triton Inference Server allow deployment of models with controlled decoding strategies and even include plugins for beam search and ensemble voting over multiple outputs. OpenAI’s own inference API, while not exposing internals, likely uses such techniques under the hood for their “instruct” vs “reasoning” models.

Hardware providers are optimizing for inference-time scaling as well. NVIDIA’s blog on DeepSeek-R1 highlights that enabling real-time chain-of-thought for a 671B parameter MoE model required massive throughput – and their upcoming Blackwell GPU architecture is explicitly tuned for this, offering up to 20 petaflops of FP4 compute and large NVLink domains to handle extensive token-parallel and expert-parallel inference (DeepSeek-R1 Now Live With NVIDIA NIM | NVIDIA Blog). This indicates that hardware and software advances are going hand-in-hand: as researchers push more complex inference computations, industry is responding with systems to support them in production.

To summarize, these reasoning-optimized inference techniques are not just academic curiosities – they are being adopted in real-world AI systems. From an engineering perspective, one must weigh the latency and cost (which we discuss next), but the payoff is improved model capability without waiting for a new model training run. As a result, many AI products in 2025 allow users to dial up inference effort when they need higher quality reasoning, effectively offering “compute-as-currency” to buy better answers on demand.

Trade-offs, Cost Considerations, and Emerging Trends

Every silver lining has a cloud: inference-time scaling, for all its benefits, comes with significant cost and complexity trade-offs. The most immediate cost is computational. Using these methods means more FLOPs per query – generating 100 samples or a 1,000-token reasoning chain can be orders of magnitude slower and more expensive than a single 1-shot answer. For companies deploying LLMs at scale, this raises infrastructure costs. It’s no coincidence that OpenAI’s o1 (reasoning model) was more expensive to use than a standard model, or that not every user query is run with maximum reasoning. Some tasks don’t need it – a simple factual question would waste cycles if we let the model “ponder” unnecessarily. A key emerging best practice is adaptive reasoning: use inference scaling selectively. Systems can be designed to detect query difficulty and only invoke heavy reasoning when likely beneficial ( Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights). Another approach is multi-tiered service: e.g., a chatbot might first try a fast shallow answer, and only if that fails or the user insists, escalate to a more intensive reasoning mode.

The compute/latency vs. accuracy trade-off can also be mitigated by methods like Chain-of-Draft (CoD) which focus on efficiency, or by distilling the benefits of inference-time reasoning into faster models. Interestingly, some works have looked at distilling inference-time behaviors: e.g., train a model to produce the final answer directly that matches the accuracy of a model using chain-of-thought with voting. This crosses the boundary between train-time and test-time improvements – effectively using inference-time reasoning as a teacher to create a more efficient student model.

From a cost standpoint, we should note “budget forcing” as a concept extends beyond the s1 paper. Many providers are exploring giving users explicit control over the “reasoning budget” – akin to a slider for how many thoughts or how long the model should think. If a user is willing to pay more or wait longer for a highly reliable answer (say for a complex medical or legal question), they can choose a higher budget. If they just need a quick guess, they use a lower budget. This user-driven trade-off is likely to become standard in AI services (somewhat like image rendering quality vs. speed settings).

Another trade-off is complexity and reliability. More moving parts (like combining an LLM with a separate reward model and a search algorithm) means more things that can go wrong – e.g., the search might get stuck in a loop, or the reward model might be misaligned with true correctness, leading the system astray. Ensuring robust performance across all these new pipelines is an active engineering challenge. The benchmark by Parashar et al. ( Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights) highlighted that each method has scenarios where it fails; an ideal system might dynamically choose between methods or combine them. We see early signs of this: some research combines multiple techniques (e.g., using “Wait” tokens and self-consistency voting together).

Emerging trends include the aforementioned thinking on demand, where reasoning is optional. In the long term, we may not distinguish “reasoning LLMs” as a separate category – much like how instruction-tuning became ubiquitous, the expectation is that all strong LLMs will have a reasoning ability and flexibly use it (The State of LLM Reasoning Models). OpenAI’s CEO hinted that future models might automatically adapt their inference compute internally, rather than requiring users to pick a reasoning versus non-reasoning model . This points toward a future of dynamic inference: models that internally decide, token by token or question by question, how much thought to put in. Techniques like ITT and latent depth are steps in this direction, giving models a built-in way to allocate resources.

Another trend is integration with external tools and knowledge bases during inference (beyond the scope of this review). Some methods allow the model to call external calculators, search engines, or databases as part of its reasoning. This can be seen as another form of inference-time augmentation, orthogonal to the ones discussed, but often complementary (e.g., a model might do a chain-of-thought, realize it needs a factual lookup, call an API, then continue reasoning).

In terms of theoretical advancements, the field is maturing in understanding the scaling laws of inference akin to scaling laws of model size. The “1B vs 405B” study showed how performance scales with more compute at test-time in a non-linear way ( Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling). We’re likely to see more formal analysis of where the sweet spots are – how many samples or steps are worth it given a task’s entropy or difficulty. There’s also growing interest in profile-guided inference: profiling a model’s behavior (like where it’s uncertain or what types of mistakes it makes) to decide an inference strategy. For example, if a model is very unsure between two answers, one might invoke a deeper chain-of-thought or a comparison step to resolve that uncertainty (somewhat like S* does with generating extra tests (The State of LLM Reasoning Models)).

In summary, inference-time compute scaling is a powerful lever now firmly in the practitioner’s toolbox. It enables smaller models and new models to punch above their weight by using clever algorithms at runtime. The trade-off is increased compute cost and system complexity, but techniques like Chain-of-Draft and dynamic depth are showing ways to keep those costs in check. Industry adoption confirms that the benefits often outweigh the costs, especially as hardware and software continue to optimize for these patterns. As research and practice continue to inform each other, we can expect reasoning-optimized LLMs to become more efficient, more autonomous in deciding how to reason, and ultimately standard in AI systems – fulfilling the promise that giving an AI more “time to think” makes it smarter and safer, just as it often does for humans.

Sources:

Muennighoff et al., “s1: Simple test-time scaling.” arXiv preprint (2025) – Introduces “Wait” token budget forcing ( s1: Simple test-time scaling).
Li et al., “Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback.” arXiv (2025) – Proposes TPO for inference alignment ( Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback).
Wang et al., “Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs.” arXiv (2025) – Identifies underthinking and introduces TIP penalty ( Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs).
Zaremba et al., “Trading Inference-Time Compute for Adversarial Robustness.” arXiv / OpenAI (2025) – Finds more inference steps improve robustness ( Trading Inference-Time Compute for Adversarial Robustness).
Pan et al., “CoAT: Chain-of-Associated-Thoughts Framework for Enhancing LLM Reasoning.” arXiv (2025) – Combines MCTS with associative memory ( CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning).
Yang et al., “Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of LMs.” arXiv (2025) – Self-backtracking strategy with ~40% performance gain (Paper page - Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models).
Geiping et al., “Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach.” arXiv (2025) – Uses latent loop to improve a 3.5B model to 50B-equivalent performance ( Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach).
Liu et al., “Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling.” arXiv (2025) – Small models beat large ones with optimal TTS ( Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling).
Parashar et al., “Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights.” arXiv (2025) – Benchmarks trade-offs; no one method wins all ( Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights).
Chen et al., “Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking.” arXiv (2025) – Dynamic depth per token (ITT) nearly matches a model 3× size ( Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking).
Li et al., “S: Test Time Scaling for Code Generation.”* arXiv (2025) – Iterative code generation + testing; 3B model with S* beats GPT-4o-mini (^∗: Test Time Scaling for Code Generation) .
Xu et al., “Chain of Draft: Thinking Faster by Writing Less.” arXiv (2025) – Achieves CoT-level accuracy with ~7.6% tokens (92% fewer) ( Chain of Draft: Thinking Faster by Writing Less).
Sebastian Raschka, “The State of LLM Reasoning Models (Part 1: Inference-Time Scaling)” (2025) – Overview of these methods and industry trends (The State of LLM Reasoning Models) .
IBM Research Blog, “Reasoning in Granite 3.2 using inference scaling” (2025) – Reports 20+ point boosts via inference scaling, 8B model exceeding GPT-4o (Reasoning in Granite 3.2 using inference scaling - IBM Research) .
NVIDIA Blog, “DeepSeek-R1 – a Perfect Example of Test-Time Scaling” (2025) – Describes deploying a 671B MoE with high inference compute, and hardware (Blackwell) optimized for test-time scaling (DeepSeek-R1 Now Live With NVIDIA NIM | NVIDIA Blog).

Rohan's Bytes

Discussion about this post