How to improve the reasoning ability of LLM through prompt engineering

Jun 16, 2025

Browse all previously published AI Tutorials here.

Table of Contents

Few-Shot Prompting
Chain-of-Thought CoT Prompting
Self-Consistency Prompting
Chain-of-Thought Variations and Extensions
- Tree-of-Thought ToT Prompting
- Self-Consistency in Trees and Multi-Agent Reasoning
Graph-Based Prompting Strategies
Chunking for Long Documents and Its Impact on Reasoning
Comparative Effectiveness on Reasoning Benchmarks
- Key Takeaways
References and Latest Research

Large Language Models (LLMs) can tackle complex reasoning tasks when guided by effective prompting strategies. Recent research (2024–2025) on Arxiv has proposed various prompt engineering techniques to improve logical, mathematical, and multi-step reasoning. Below, we review key methods – few-shot prompting, chain-of-thought (CoT), tree-of-thought (ToT), self-consistency, graph-based prompts, and document chunking – providing a technical summary of each and comparing their effectiveness on benchmarks.

Few-Shot Prompting

Few-shot prompting supplies an LLM with a handful of exemplars (input-output pairs) in the prompt to demonstrate the task. This in-context learning allows the model to infer the pattern and improves performance on tasks it wasn’t explicitly trained for ( Reasoning with Large Language Models, a Survey). For reasoning tasks, carefully chosen examples (especially ones that include step-by-step solutions) can prime the model to produce logical reasoning in its answers. Prompts that contain a few examples are “few-shot” prompts, whereas those with only an instruction are zero-shot . Few-shot examples essentially show the model how to reason through a problem.

Technical breakdown: Constructing an effective few-shot prompt involves selecting exemplars that are similar in style or structure to the target query. Research has shown that matching the reasoning style of examples to the model’s preferences can help. For instance, Aligned CoT prompting (2024) suggests aligning the style of few-shot rationales with the model’s native style to boost reasoning performance ( arXiv:2311.13538v5 [cs.AI] 7 Oct 2024). There are also methods to automate exemplar generation and selection. AutoReason (Sevinc & Gumus, 2024) proposes automatically generating step-by-step rationale examples for a given query, by decomposing the query into sub-questions and solving those ( AutoReason: Automatic Few-Shot Reasoning Decomposition). This removes the need for hand-crafted exemplars and adapts the prompt to each query, improving multi-step reasoning especially for smaller or less capable LLMs . Overall, few-shot prompting capitalizes on LLMs’ emergent ability to learn from context, and it often substantially outperforms zero-shot prompts on reasoning benchmarks.

Effectiveness: Few-shot prompting was a breakthrough that enabled complex reasoning in LLMs. It achieved “breakthrough performance” on many tasks compared to zero-shot ( Reasoning with Large Language Models, a Survey). By providing exemplars, models can more than double their accuracy on math word problems and logic puzzles versus giving an answer directly. Recent work also shows that automated prompt optimization can yield further gains. For example, automatically generated rationales in AutoReason improved accuracy on StrategyQA and HotpotQA Q&A tasks relative to standard few-shot prompts . In summary, a well-crafted few-shot prompt is a simple yet powerful way to elicit reasoning, and ongoing research is making the process of finding good examples more systematic.

Chain-of-Thought CoT Prompting

Chain-of-Thought prompting instructs the LLM to produce a step-by-step explanation or reasoning chain before giving the final answer . By explicitly generating intermediate reasoning steps in natural language, the model “shows its work,” which often leads to better correctness on tasks involving arithmetic, logic, or multi-hop reasoning . A classic example is adding a prompt like “Let’s think step by step” or providing a worked-out example solution, which cues the model to break the problem into smaller steps . CoT prompting effectively moves the model from fast “System 1” responses to more deliberate “System 2” reasoning .

Technical breakdown: In practice, CoT can be used in few-shot mode (providing exemplars with reasoning) or even zero-shot. Kojima et al. (2022) famously found that appending “Let’s think step by step.” to a question prompts certain LLMs to do multi-step reasoning without any examples ( Reasoning with Large Language Models, a Survey). The CoT approach was pioneering – it showed that even large models not explicitly trained for reasoning can follow a chain-of-thought prompt and achieve much better accuracy on complex questions . For example, the original CoT paper reported a “surprising jump in performance” on arithmetic and commonsense reasoning tasks simply by eliciting multi-step solutions . Technically, CoT works by guiding the model through each inference needed: the prompt or examples demonstrate a reasoning process, and the model then continues that process for the new query. Because the model’s output includes the reasoning steps, one can also inspect or verify the logic (though it may still make errors in those steps).

Effectiveness: Chain-of-thought prompting has proven to dramatically improve accuracy on many benchmarks. On grade-school math problems, for instance, CoT prompting turned out to be a “breakthrough” – models that got only trivial accuracy when forced to answer directly could reach decent success with CoT . Empirical results across benchmarks like GSM8K (math word problems), ASDiv, and others showed large gains. In one survey, CoT prompt solutions achieved 71.3% on a math word dataset that the model struggled with under direct prompting . CoT has since become a baseline technique for eliciting reasoning in LLMs, upon which many other methods build. Its strength lies in enabling multi-step deduction within a single forward-pass of the model.

Self-Consistency Prompting

Self-consistency is an augmentation to chain-of-thought prompting that addresses the variability of a single reasoning path ( Reasoning with Large Language Models, a Survey). The idea is to sample multiple distinct reasoning chains for the same question (e.g. by using random decoding or slightly varied prompts) and then let the most consistent answer across those chains be the final output . In essence, instead of trusting one chain-of-thought, the model generates an ensemble of reasoning paths and answers, and we pick the answer that appears most frequently or is agreed upon by the majority of reasoning samples.

Technical breakdown: Self-consistency was introduced by Wang et al. (2022) as a “straightforward ensemble approach” to improve CoT . Implementing it involves prompting the LLM to produce, say, 5–10 independent chains of thought (by using a higher temperature in generation to induce diversity). Each chain ends in an answer. Then a simple voting is done: the answer that appears most often is chosen as the model’s answer. The assumption is that incorrect reasoning paths will often diverge to different wrong answers, whereas the correct reasoning (if the model can find it) is likely to lead to the same correct answer each time . This method leverages the stochastic nature of LLM outputs to our advantage. Notably, self-consistency does not require any additional training – it’s a prompting inference-time technique that trades more computation for higher accuracy.

Effectiveness: Self-consistency has been shown to significantly boost reasoning accuracy on top of CoT prompting. Empirical evaluations found it can improve CoT performance by 10–20 percentage points on arithmetic, commonsense, and symbolic reasoning tasks ( Reasoning with Large Language Models, a Survey). For example, if a chain-of-thought prompt alone got 50% accuracy on a set of math problems, self-consistency (ensemble of 20 reasoning samples) might raise that to around 60–70% by filtering out inconsistent answers . This approach has become a common baseline in recent papers – many new prompting strategies will combine with self-consistency to further ensure robust results. The success of self-consistency highlights that LLMs may sometimes produce faulty reasoning in one pass, but given multiple tries they often “cluster” around the correct reasoning, which we can exploit.

Chain-of-Thought Variations and Extensions

Building on the linear CoT idea, researchers have proposed structured prompting approaches like Tree-of-Thought (ToT) prompting and other search-based strategies. These aim to let the model explore multiple reasoning paths in a tree structure instead of a single chain, enabling backtracking and considering alternative solutions . We group Tree-of-Thought and related multi-path techniques here.

Tree-of-Thought ToT Prompting

Tree-of-Thought prompting (Yao et al., 2023) generalizes CoT by allowing the model to branch out into different possible steps at intermediate points, forming a tree of reasoning possibilities ( Reasoning with Large Language Models, a Survey). Rather than committing to one narrative of reasoning, the model can explore various approaches or partial solutions, and a search algorithm (DFS, BFS, or beam search) is used to find a promising path to the correct answer. This approach is inspired by how one might manually try different ideas to solve a puzzle.

Technical breakdown: In a ToT prompt, at certain junctures the model is prompted to generate multiple continuations (thoughts) instead of one. For example, after reading a question, the model might propose two or three different next steps or hypotheses. Each of those is then expanded in subsequent prompts, and so on, creating a search tree of thoughts. An external controller evaluates the branches – possibly using the LLM itself as a judge of partial progress – and decides which branch(es) to explore further (HERE). The process continues until a solution is found or a search depth limit is reached. Key components include a state evaluator (to judge if a partial solution seems promising or if a branch should stop) and a strategy to backtrack if a line of reasoning hits a dead-end . Essentially, ToT algorithms turn the one-shot reasoning into an iterative decision-making procedure, where the LLM at each step can reconsider and try alternatives.

Effectiveness: Tree-of-Thought prompting has shown the ability to solve problems that stumped linear CoT. By enabling backtracking and exploration, ToT achieved much higher success rates on certain puzzle-like tasks. For instance, on a crossword puzzle benchmark, a standard chain-of-thought had <16% success, while a Tree-of-Thought strategy reached a 60% success rate, solving 4× more puzzles . This dramatic improvement is because CoT (and straightforward few-shot) cannot recover from an incorrect assumption mid-way, whereas ToT can try a different path if one line of reasoning fails . Recent papers have continued to refine ToT. Forest-of-Thought (FoT), proposed in late 2024, extends this idea by exploring multiple trees in parallel and using consensus among them – effectively a “forest” of reasoning – along with dynamic self-correction ( Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning). FoT uses sparse activation to focus on relevant paths and a consensus mechanism to decide the answer, yielding further accuracy gains on complex logical problems . In summary, ToT-based prompting strategies improve the systematic exploration of solution space, which is especially useful for tasks like planning, puzzles, and games that benefit from trial-and-error reasoning.

Self-Consistency in Trees and Multi-Agent Reasoning

Combining self-consistency with tree search is a next step in 2024 research. One example is the Multi-Agent Tree-of-Thought (Validator) approach by Haji et al. (2024), which assigns multiple “reasoner” agents to explore reasoning trees in parallel and then uses a separate validator agent to verify the correctness of each reasoning path ( Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent). Only validated reasoning paths are allowed to vote on the final answer. This approach addresses the issue that ToT may generate many branches, some logically flawed – the validator prunes those. In experiments on the GSM8K math word problem benchmark, the multi-agent ToT with validation improved accuracy by 5.6% on average across four different LLMs compared to standard Tree-of-Thought prompting . This shows that injecting a form of self-consistency or agreement check into tree-based reasoning further boosts reliability. In general, the trend is to use more compute and structure at inference-time (e.g. multiple agents, multiple reasoning paths) to push LLM reasoning closer to correctness.

Graph-Based Prompting Strategies

Graph-based prompting techniques integrate structured knowledge (graphs or networks of information) into the LLM’s reasoning process. One prominent example is Graph Chain-of-Thought (Graph-CoT) by Jin et al. (2024), which augments the CoT approach with an external knowledge graph ( Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs). The intuition is that for knowledge-intensive reasoning (e.g. academic question answering, where concepts are linked via a citation graph or other relations), having the model walk a graph can improve factual accuracy and multi-hop reasoning.

Technical breakdown: Graph-CoT operates in an iterative loop between the LLM and a graph database . Each iteration consists of: (1) LLM reasoning – the model generates the next step in natural language, which often includes a query or reference to graph nodes; (2) LLM-graph interaction – this step takes the model’s query and retrieves relevant linked information from the graph (for instance, finding a connected node or an attribute in the knowledge graph); (3) Graph execution – the retrieved graph info is fed back into the LLM’s context for the next round. In this way, the prompt is dynamically updated with knowledge from the graph as the chain-of-thought unfolds . The model is encouraged to use graph context (e.g. “According to the graph, X is connected to Y, so…”) in its reasoning steps. This strategy effectively combines symbolic structured reasoning with the LLM’s natural language reasoning.

Effectiveness: Incorporating graph-based reasoning has been shown to reduce hallucinations and improve accuracy on tasks that require piecing together information across connected sources ( Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs). Jin et al. introduced a benchmark called GRBench (Graph Reasoning Benchmark) with questions answerable via graph knowledge, and found that Graph-CoT outperformed standard text-only CoT baselines on all tested domains . By iteratively consulting a graph, the LLM can maintain better consistency on multi-hop queries (for example, academic questions where you might need to follow citations or a family tree of concepts). Graph-based prompting is an emerging area, and beyond Graph-CoT there are related ideas like converting text reasoning into a graph-of-facts internally. The key takeaway is that explicit graph structures can guide LLM reasoning in cases where relational or networked knowledge is crucial, complementing the pure text chain-of-thought.

Chunking for Long Documents and Its Impact on Reasoning

When dealing with very long texts or document collections (e.g. during document digitization or question-answering over long reports), prompt engineers use chunking methods – splitting the text into smaller pieces – to fit within LLM context windows. How this chunking is done can greatly affect the model’s reasoning ability on the overall document. If done naively (e.g. chopping every N tokens arbitrarily), the model might lose important cross-chunk context, leading to reasoning failures. Recent research in 2024 has examined chunking strategies that preserve logical and structural context, improving reasoning over long inputs.

Technical breakdown: A typical pipeline for long document QA is Retrieval-Augmented Generation (RAG): the document is broken into chunks, a retrieval step finds relevant chunks for a query, and the LLM reasons over those. The question is how to chunk optimally. One study by Yepes et al. (2024) proposes chunking based on document structure elements (headings, sections, tables, etc.) rather than uniform blocks (HERE). By leveraging a document understanding model to segment a financial report into logical components, they retain more coherent context in each chunk. This leads to more accurate answers because each chunk is a semantically complete piece of information. They showed that such element-type-based chunking yielded better QA performance than plain paragraph or token-based chunks . Another consideration is chunk overlap and order: sometimes overlapping chunks or hierarchical summarization (summarize each chunk then summarize the summaries) is used so the model can reason about relationships between parts. In essence, chunking is a form of prompt engineering for long inputs – it decides which pieces of the text the model sees, and in what form.

Effectiveness: Proper chunking can dramatically improve an LLM’s ability to reason over large documents. In the financial domain experiment, structuring chunks by content type led to significantly higher accuracy in retrieval and QA, as compared to naive equal-sized chunks . The model was better at maintaining factual consistency and combining information spread across the document. Generally, chunking strategies that align with logical document sections help the LLM preserve context, whereas poor chunking can cause the model to miss connections (for example, a cause in one chunk and an effect in another that never get considered together). It’s been observed that for complex multi-hop questions on long texts, the model might need to see two or more chunks at once – motivating methods like sliding windows or iterative prompts that gather info from different chunks. In summary, chunking is crucial for extending LLM reasoning to long inputs: intelligent chunking minimizes context loss and thereby improves the quality of multi-step reasoning across documents.

Comparative Effectiveness on Reasoning Benchmarks

Each prompting method above has strengths in different scenarios, and researchers have benchmarked them on various tasks (math word problems, logic puzzles, commonsense QA, etc.) to quantify their gains:

Few-shot vs. Zero-shot: Providing a few exemplars nearly always improves reasoning performance over zero-shot. Large models like GPT-3 showed emergent in-context learning abilities, where a few demonstrations yield “breakthrough performance” on translation, QA, and reasoning tasks ( Reasoning with Large Language Models, a Survey). However, cleverly chosen prompts can sometimes narrow the gap – e.g. a zero-shot CoT instruction “think step by step” enabled models to rival few-shot results on some benchmarks . In practice, few-shot CoT is a standard for challenging reasoning tasks.
Chain-of-Thought Prompting: CoT has been a game-changer for tasks like arithmetic and logical questions. For example, Wei et al. (2022) reported that CoT prompting boosted a model’s accuracy on GSM8K math problems from near chance to 40%+, and follow-up tuning pushed it higher. CoT’s impact is evident in benchmarks: one survey notes it turned a model’s accuracy on several math word problem datasets from 10–30% (pre-CoT) to as high as 70–90% with CoT . It’s particularly effective for multi-step calculation or inference, ensuring each sub-problem is solved.
Self-Consistency: Self-consistency further enhances CoT. By voting across multiple reasoning paths, it improves accuracy by 10–20% across a range of tasks ( Reasoning with Large Language Models, a Survey). This has been observed on arithmetic reasoning and commonsense QA: the final answer agreed on by an ensemble of chains is more likely correct than any single chain’s answer. Self-consistency has become a de facto addition when maximum accuracy is needed, albeit at the cost of more API calls or compute. Many 2024 studies include self-consistency as a baseline – for instance, Progressive Hint Prompting (Zheng et al., 2023) combined CoT with self-consistency to reach state-of-the-art 91% on simple math word problems using GPT-4 .
Tree-of-Thought and Variants: ToT shines on creative or search-intensive problems. In evaluations on puzzles (like Sudoku, crosswords, reasoning games), ToT methods substantially outperform linear CoT. As noted, a Tree-of-Thought approach achieved 60% success on a crossword task vs. 16% for CoT (HERE). Similarly, other studies found ToT-based prompts solve significantly more instances of problems that require trying different possibilities (e.g. step-by-step planning tasks). Advanced versions like Forest-of-Thought indicate that scaling up the search (multiple trees with consensus) yields even higher precision on logical benchmarks ( Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning). That said, for straightforward question-answer tasks, ToT might not always be necessary – its benefits are largest when systematic trial and error is needed.
Graph-Based Reasoning: Graph-augmented prompting is specialized for knowledge-heavy tasks. On the GRBench suite, Graph-CoT outperformed text-only reasoning by a consistent margin ( Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs). This included domains like academic paper networks and medical knowledge graphs, where querying the graph for relevant nodes helped answer questions more accurately. In traditional QA benchmarks (e.g. open-domain questions), graph-based methods can help if the question requires connecting distant pieces of knowledge. They are a form of tool-use prompting: the LLM uses the graph as an external memory. Real-world tasks like medical diagnosis or legal reasoning, which have underlying knowledge structures, could benefit from these strategies.
Chunking for Long Contexts: For tasks involving long documents (e.g. answering questions from a 100-page report or a book), chunking strategies determine success. Studies in 2024 showed that without smart chunking, LLM answers may be incomplete or incorrect due to missing context. By contrast, chunking by document sections and feeding the model relevant pieces yields more faithful reasoning (HERE). In practical benchmarks like narrative QA or multi-document summarization, approaches that chunk and then use hierarchical CoT (reasoning on chunks, then merging) have outperformed one-shot reading of as much text as fits in the context window. Essentially, chunking plus iterative prompting enables scaling reasoning to long texts which would otherwise exceed the model’s capacity.

Key Takeaways

Different methods complement each other: For example, chain-of-thought can be combined with self-consistency (to verify answers) and with tree search (to explore alternatives). Many state-of-the-art prompting frameworks in 2024 actually use a hybrid of these techniques.
Empirical gains are task-dependent: On straightforward math word problems, CoT + self-consistency might be enough to reach high accuracy ( Reasoning with Large Language Models, a Survey). On tricky logic puzzles, Tree-of-Thought or other search-based prompts make a big difference (HERE). On tasks needing factual knowledge, graph-based or chunk-based prompts help inject the needed info ( Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs).
Benchmarks show significant improvements: As an example, a program-aided CoT with beam search (Xie et al., 2024) beat standard few-shot by 6–9% on GSM8K, AQuA, and StrategyQA , illustrating that prompt strategies can continue to narrow the gap to 100% accuracy. Multi-agent validation in ToT gave another ~5% boost on GSM8K over vanilla ToT ( Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent).
Reasoning quality vs. compute trade-off: Techniques like self-consistency and ToT use multiple model calls or longer generation, trading compute for better reasoning. Real-world deployments must balance this, but for important use-cases (e.g. medical advice, code generation), the improved correctness can be worth the extra cost.

References and Latest Research

Each of the methods discussed has a body of recent literature behind it. Summarizing the latest insights:

Few-shot prompting: In-context learning is enabled by large models and has been surveyed by Plaat et al. (2024) . New methods like AutoReason generate few-shot exemplars automatically to adapt reasoning to each query ( AutoReason: Automatic Few-Shot Reasoning Decomposition).
Chain-of-Thought: Introduced by Wei et al. and expanded by many works, CoT remains fundamental. Kojima et al.’s zero-shot CoT finding made CoT even more accessible. CoT’s ability to induce “System 2” thinking is well-documented .
Self-Consistency: Proposed by Wang et al. (2022) and now frequently used. It typically yields 10–20% boosts on reasoning accuracy and is a common baseline for comparison in 2024 papers.
Tree-of-Thought: Yao et al. (2023/24) demonstrated the power of tree search in prompting. Follow-ups like Multi-Agent ToT with validators and Forest-of-Thought (Bi et al., 2024) ( Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning) show ongoing improvements in this direction.
Graph-based prompting: Jin et al. (2024) introduced Graph-CoT, showing how to use graphs for LLM reasoning . This opened a line of research into grounded reasoning on structured knowledge.
Chunking for long documents: Yepes et al. (2024) studied chunking in RAG pipelines, finding that structurally-informed chunking greatly improved answer quality on long text QA (HERE). Long context reasoning is also tested in benchmarks like BABILong and others addressing LLM context limits (NeurIPS 2024).

In conclusion, prompt engineering in 2024–2025 has equipped LLMs with better “thinking” strategies. Few-shot and CoT prompting give models a basic reasoning ability, self-consistency and tree-of-thought add reliability and depth, graph-based and chunking methods extend reasoning to knowledge-intensive and long-text scenarios. Empirical results across numerous benchmarks consistently show these methods enhance the logical and multi-step reasoning capabilities of LLMs, bringing us closer to general problem-solving AI ( Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent).

Rohan's Bytes

Discussion about this post