Hallucination in LLMs

Apr 20, 2025

Browse all previoiusly published AI Tutorials here.

🤔 Why Do LLMs Hallucinate?
💡 Prompt Engineering to Curb Hallucinations
🎛️ Calibration and Uncertainty Estimation
🔗 Retrieval-Augmented Generation (RAG) Advances
🔎 Post-hoc Verification Pipelines
🔧 Fine-Tuning and Instruction Tuning Approaches
🗃️ Data Curation and Training Strategies
⚠️ High-Stakes Domains: Why Hallucination Is Critical

🤔 Why Do LLMs Hallucinate?

Hallucination refers to an LLM generating content that is plausible-sounding but incorrect or ungrounded in reality. Modern LLMs are trained to predict the most likely next token given context, **without any built-in requirement to be truthful or f (LLM Hallucinations: A Bug or A Feature? – Communications of the ACM), This fundamental design means that if the most statistically likely continuation of a prompt is factually wrong, the model will still produce it. In other words, LLMs lack an internal fact-checker – they rely purely on patterns learned from data, not a connection to grou 39,

Several interrelated architectural and training factors cause hallucinations in 2024-era LLMs:

Imperfect Training Data: Models ingest terabytes of text that inevitably contain errors, biases, and contradictions. If the data on a topic is incomplete or inconsistent, the model may fill gaps with invented (LLM Hallucinations: A Bug or A Feature? – Communications of the ACM), Biases in data can also skew outputs (e.g. a model might confidently state a false “fact” if it saw it phrased authoritatively during training). Hallucination is essentially the model “making up” an answer when its knowledge is uncertain or (LLM Hallucination Risks and Prevention ),
Next-Token Objective (No Truth Criterion): The training objective optimizes for fluency and coherence, not factual accuracy. As a result, an LLM will often choose an answer that sounds right over one that is right. It has no innate ability to distinguish truth from f 79, A perfectly grammatical, confident-sounding sentence can be completely fabricated because the model’s loss function never explicitly penalized factual errors during pretraining.
Rare or One-off Facts: If a fact or name appears only once (or very rarely) in training data, the model’s grasp of it is tenuous. Recent theoretical work (2024) showed there’s an inherent lower bound on hallucination rates for such “arbitrary” facts, even in an ideally train ( Calibrated Language Models Must Hallucinate), In other words, a well-calibrated language model must sometimes misremember or hallucinate facts that were seen only once. This happens because the model cannot assign 100% probability to a fact it has minimal exposure to, so occasionally it will generate an incorrect variant. The probability of hallucinating roughly correlates with the fraction of facts in training that were singleton occ , This implies that no matter how advanced the architecture, ultra-rare knowledge is never perfectly retained without special handling. (On the flip side, the same analysis suggests there is no fundamental reason for models to hallucinate extremely common or systematic facts like simple arithmetic – those errors can potentially be fixed with better training or archi ,)
Long Contexts and Distractions: LLMs with very long prompts or many context documents can get “distracted.” Irrelevant or extraneous information in the context may confuse the generation process, leading to outputs that stray o ( Chain of Natural Language Inference for Reducing Large Language Model Ungrounded Hallucinations), Similarly, if a prompt is ambiguous or underspecified, the model might improvise details to fill in the blanks. Complex multi-step reasoning tasks are especially prone to hallucination – errors can compound over a chain of reasoning, causing a confidently stated wron ,
Source vs. Response Divergence: In tasks like summarization or QA with a given document, extrinsic hallucinations occur when the model outputs information that cannot be verified by the source material (even if it sounds pl (LLM Hallucinations: A Bug or A Feature? – Communications of the ACM), For example, the model might inject a detail not present in the source text. This often stems from the model’s general world knowledge bleeding into a context-specific answer. If its parametric memory suggests something related, it may include it even though the instruction was to stick to the provided document.
Overfitting and Style Imitation: Paradoxically, if a model is overfit or heavily fine-tuned on specific formats, it might hallucinate by mimicking patterns too rigidly. For instance, if it learned a style of answering from training data that included making up citations or facts (to sound “scholarly”), it might continue that pattern inappropriately. Overfitting to training data can also mean the model fails to generalize to novel queries and instead regurgitates irrelevant info from training (which appears as a hallucination in the new (LLM Hallucination Risks and Prevention ),
No Uncertainty Reporting: Base LLMs don’t know when to say “I don’t know.” They will attempt an answer for almost any query. This is partly a design choice – early GPT-style models were trained to always continue the text. It’s also due to reinforcement learning from human feedback (RLHF) which often penalized responses like “I’m not sure.” Thus, models tend to answer even when their knowledge is shaky, often resulting in confident-sounding fabrications. In short, poor calibration of confidence (discussed more later) means the model’s internal probability might only be 40% on an answer, but it will state it with 100% confidence unless guided otherwise.

Given these causes, many researchers argue that some level of hallucination is inevitable. A theoretical analysis from the National University of Singapore in 2024 contends that no computable LLM can know all facts or perfectly avoid false outputs, since an LLM cannot learn every function mapping inputs to correct (LLM Hallucinations: A Bug or A Feature? – Communications of the ACM), Rather than a bug, hallucination may be a inherent trade-off of using statistical learners. That said, not all hallucinations are equal – straightforward factual errors (e.g. wrong dates or names) might be greatly reduced with better training, whereas open-ended “creative” prompts will always invite some imaginative completion. The current consensus is that we can mitigate and manage hallucinations to a very low rate in critical applications, even if we cannot 100% eliminate them.

💡 Prompt Engineering to Curb Hallucinations

One immediate line of defense against hallucinations is prompt engineering – crafting the input and instructions to steer the model away from making things up. In , practitioners have developed a toolkit of prompt-based methods that significantly reduce hallucination without requiring any model retraining. These techniques leverage the fact that LLMs do respond to how you ask. Key prompt engineering strategies include:

Encourage Abstention (Allow “I don’t know”): By default, models often feel compelled to answer. Explicitly telling the model it’s okay to say it doesn’t know will yield more honest outputs. For example, a system or user prompt might state: “If you are unsure of an answer or lack sufficient information, respond with ‘I don’t have enough information to answer that.’” This simple instruction can drastically cut down on fabricated answers, as the model is no longer forced to conjure an answer from (Reduce hallucinations - Anthropic), Anthropic’s Claude documentation in 2024 emphasizes this as a basic guardrail – giving the model permission to admit uncertainty often leads it to gracefully refuse to answer rather than ha 112, Many fine-tuned chat models (e.g. OpenAI’s latest) have been trained to do this in unsafe or unknown scenarios, but an explicit instruction reinforces the behavior for factual queries.
Require Evidence or Quotes: Another prompt pattern is to demand that the model show its work by quoting sources. For instance: “Include supporting quotes from the provided text for each of your claims, and if you cannot find a supporting quote, state that the information is not present.” By making the model literally pull exact phrases from a reference text, we anchor its output t 154, This approach was highlighted by Anthropic: first ask the model to extract the most relevant sentences from the source document, then have it base its answer strictly on tho 130, This prevents the model from wandering off-topic or introducing outside “world knowledge” that might be incorrect. Essentially, the model becomes an assembler of known pieces rather than a free-form author. Studies in late 2024 showed this dramatically improves factuality in tasks like document analysis or QA: the model’s answer stays within the bounds of the retrieved 168, The trade-off is that the style may become more literal and less flowing, but for high-stakes applications, accuracy trumps elegance.
Verification and Self-Checking Prompts: One can prompt the model to verify each of its statements after drafting an answer. A template for this might be: “First, answer the question. Then, for each sentence in your answer, check the source text for confirmation. If a sentence is not supported, revise or flag it.” This effectively makes the model perform a second-pass audit (Reduce hallucinations - Anthropic), Anthropic calls this “verify with citations”, where Claude is instructed to find a source for each claim it made and remove any claim that lack 168, This post-processing step in the prompt can catch hallucinations immediately – the model may realize “Oops, I stated X but I actually can’t find it in the text” and then fix or omit that part. Prompting the model to explicitly enumerate its reasoning or chain-of-thought before final answer is another variant: if it explains how it got an answer step-by-step, inconsistencies or leaps of logic (often a sign of hallucination) can be identified and corrected m 179, This chain-of-thought verification approach uses the model’s own reasoning capabilities to sanity-check the result.
Constrained Answer Formats: By structuring the output format tightly, we leave the model less room to hallucinate. For example, in a database query task, we might say: “List the exact entries from the provided table that answer the query, and nothing else.” If the model is instructed to output in a JSON with specific fields, or to choose from a given list, it’s less likely to inject unsupported content. A recent study (Rumiantsau et al. 2024) on data analytics QA found that using Structured Output Generation (like fixed schemas or lists) was more effective at reducing hallucinations than even fine-tuning ( Beyond Fine-Tuning: Effective Strategies for Mitigating Hallucinations in Large Language Models for Data Analytics), Essentially, formatting acts as a guardrail: the model can’t stray too far if the required answer is, say, a number from a table or a quote from text.
I write everyday for my readers on actionable AI. Subscribe and instantly get a 1300+ page Python book.
External Knowledge Restriction: If you supply reference documents (e.g. an article or a knowledge base excerpt) as part of the prompt, it helps to explicitly tell the model to use ONLY that information. For instance: “Use only the information in the following passage to answer, do not use any outside knowledge.” This reminder can suppress the model’s impulse to rely on potentially outdated or incorrect parametr (Reduce hallucinations - Anthropic), The model will treat the prompt-provided text as the sole ground truth. However, note that if the prompt documents don’t actually contain the answer, the model might then correctly respond that it cannot find the answer (rather than hallucinating it) – which is usually the safer outcome.
Role-playing and Perspective: Sometimes giving the model a specific role or persona in the prompt can mitigate hallucinations. For example: “You are a meticulous financial auditor. If information is missing, you state that it’s unavailable rather than guessing.” Defining the assistant’s persona as cautious, evidence-driven, and penalty-averse to mistakes can influence the style of its answers. This is a softer form of steering compared to hard constraints, but it does have an effect in making the output more conservative with facts.
Re-asking and Majority Voting: A clever prompt-based trick is to run the same query multiple times (with slight rephrasing or using the model’s randomness) and then compare the answers – this is often called best-of-N or self-consistency. If an answer is factual and well-grounded, multiple independent runs of the model should converge on the same result. But if each run diverges, that indicates uncertainty (and likely hallucination in some (Reduce hallucinations - Anthropic), One can programmatically or manually take a majority vote or intersection of the answers, reducing the chance of accepting a one-off hallucinated response. Some implementations have the model itself do this: e.g. “Provide three possible answers with your reasoning, then decide which answer is most consistent among them.” This uses the model’s own ensemble to filter out inconsistent hallucinations.

It’s important to stress that prompt engineering cannot fully eliminate hallucinations, but it can drastically lower their frequency. By 2025, these methods are considered best practices when deploying LLMs in production. They are relatively low-cost (no retraining needed) and model-agnostic. In critical applications, developers often layer multiple prompt tactics: e.g. instruct the model to say “I don’t know” when unsure, require evidence citations, and have a final verification pass – all within the prompt 112, Empirical results in medical Q&A have shown that such inference-time techniques like chain-of-thought prompting and search-augmented prompting significantly drop hallucination rates, though no (Medical Hallucination in Foundation Models and Their Impact on Healthcare | medRxiv), Prompt design is thus an essential first line of defense while deeper model-centric solutions continue to evolve.

🎛️ Calibration and Uncertainty Estimation

A recurring theme in hallucination research is calibration: does the model know when it might be wrong? An ideally calibrated model would only answer when it’s sufficiently confident in the factuality of its response, and would abstain or express uncertainty otherwise. Many hallucinations can be viewed as a calibration failure – the model gives a high-confidence answer to a query where it actually had low information.

Why are current LLMs poorly calibrated? One reason is that during fine-tuning (especially with RLHF), models are trained to be decisive and user-friendly, which inadvertently penalizes expressions of doubt. They learn to prefer a fluent answer over an admission of ignorance. In fact, research has found that reinforcement fine-tuning can degrade a model’s calibration even compared to its base pre-tra (Extrinsic Hallucinations in LLMs | Lil'Log), The model might assign internally a moderate probability to an answer, yet verbalize it without hedging. This means the user can’t tell the model’s true uncertainty – a dangerous situation in high-stakes use.

Connect with me on X (Twitter)

To address this, 2024 saw progress in uncertainty quantification for LLMs. A notable work by Gal et al. (Nature, June 2024) introduced an entropy-based method to detect hallucinations by measuring **semantic unc (Detecting hallucinations in large language models using semantic entropy | Nature) , The idea is to sample the model or otherwise estimate the entropy at the level of meaning rather than surface text. If a model generates many paraphrased answers with divergent facts for the same question, it indicates high uncertainty and likely hallucination. Their method doesn’t need task-specific training – it statistically flags when a prompt is likely to induce a confabulation, i.e. an arbitrary, unsuppor , In experiments, this approach generalized across datasets and could warn users when an answer should be taken wi , Essentially, instead of just using the model’s final answer, they use the distribution of possible answers as a tell for unreliability.

Another development is calibrating models to refuse or defer if uncertain. OpenAI and others have not published technical details, but it’s evident that newer models (GPT-4, Claude 3, etc.) are more willing to say they cannot answer. Anthropic reported that Claude 3 (2024) was explicitly evaluated on a wide set of challenging factual questions and it showed a significant reduction in incorrect answers (hallucinations) compared to Claude 2.1, *in part due to the model choosing uncertainty or refusal when ap (Introducing the next generation of Claude \ Anthropic), By labeling some queries in training where the best answer was “I don’t know,” they taught the model to recognize those situations. Mistral AI similarly emphasized that their Mistral Large 2 model was fine-tuned to be more cautious and acknowledge when it doesn’t have enough information, rather than fabricating (Large Enough | Mistral AI) , This kind of tuning directly improves calibration: the model’s behavior (saying “cannot answer”) aligns with its lack of knowledge.

There is also interest in quantitative calibration metrics. For instance, evaluating the Expected Calibration Error (ECE) on truth-based tasks – do probabilities the model assigns to answers match the actual correctness frequency? Some research (e.g. on legal QA in 2024) estimated calibration errors and found models often overestimate their accuracy on domain-specific ( arXiv:2401.01301v1 [cs.CL] 2 Jan 2024), Techniques like temperature scaling (common in classification tasks) can be applied to LLMs in a limited way: e.g. adjusting the “certainty” of the model’s wording based on entropy. One practical approach is to use a calibration prompt: ask the model how confident it is or to provide a likelihood along with the answer. In a controlled eval (Kadavath et al. 2022, noted in a 2024 blog summary), GPT-style models could output a probability with their answer and, when prompted properly, those probabilities correlated reasonably well with actual c (Extrinsic Hallucinations in LLMs | Lil'Log), However, RLHF had made the raw model probabilities less useful, requiring careful re-calibration or use of a separate model.

Beyond self-calibration, a complementary approach is an external “hallucination detector”. This could be a separate lightweight model or heuristic that analyzes the main model’s output for signs of hallucination. For example, one might train a classifier on known truthful vs false outputs (if such data can be generated) or use consistency checks (e.g. does the answer contradict known facts?). The semantic entropy method mentioned earlier is one example of a detector that doesn’t rely on ground truth comparison but on the model’s output di (Detecting hallucinations in large language models using semantic entropy | Nature), Another simple technique: run a second instance of the model with a prompt like “Is the following answer supported by facts? [insert answer]” – essentially asking the model to critique itself. This is not foolproof, but sometimes the model can identify its own hallucination upon reflection (especially if the second prompt encourages critical analysis rather than helpfulness). Microsoft’s CoNLI/CoVe framework (late 2023) uses natural language inference (NLI) as a verification step: after the model answers using some source, they use an NLI model to check if the answer is entailed by the source; if not, the answer is flagged and ( Chain of Natural Language Inference for Reducing Large Language Model Ungrounded Hallucinations), This kind of plug-and-play verification achieved state-of-the-art detection of ungrounded statements and could rewrite the offending parts to improve gr , Such pipelines effectively calibrate the final output by excising low-confidence (unsupported) pieces.

In summary, calibration is about aligning the model’s confidence with reality and ensuring it behaves appropriately when uncertain. By 2025, the best LLM deployments combine internal calibration (via training and prompting the model to express uncertainty) with external calibration (via detectors or guardrails that catch low-confidence situations). The result is that cutting-edge models are far less likely to blurt out unchecked falsehoods. As evidence: Google’s latest Gemini 2.0 model reports a <2% hallucination rate on the Vectara benchmark, which indicates highly calibrated, grounded p (Google Research 2024: Breakthroughs for impact at every scale) , This is achieved through many of the techniques above (and heavy use of retrieval, as we discuss next). Ultimately, improving calibration builds user trust – the model will hesitate when it should, and users can start to tell apart when it’s on solid ground versus when it’s guessing.

🔗 Retrieval-Augmented Generation (RAG) Advances

One of the most active areas of development in 2024 has been Retrieval-Augmented Generation (RAG). RAG marries an LLM with an external knowledge source: before generating an answer, the system retrieves relevant documents or data and presents them to the model as additional context. The intuition is simple: if the model has the actual facts in front of it, it doesn’t need to hallucinate! Instead of solely relying on parametric memory (which might be stale or incomplete), the model can draw from a fresh, non-parametric memory (a database, the web, etc.). RAG has been around for a few years, but recent research has greatly refined its effectiveness and scope in combating hallucinations.

Key benefits of RAG: it provides grounding. The model’s output can be directly tied to source text, reducing the chance of arbitrary invention. Hallucinations are often replaced by correct excerpts or summaries of retrieved content. This tackles especially those queries about specific, up-to-date, or niche information that the base model wasn’t trained on. For example, a base LLM might not know details of a 2023 tax law, but a RAG system could fetch the relevant law text and have the LLM incorporate it, avoiding a hallucinated answer.

In 2024, major AI labs deployed RAG in high-profile systems. Google’s Gemini AI, for instance, includes features like “Related Sources” and “Double-check” in its interface, which are powered by retrieva (Google Research 2024: Breakthroughs for impact at every scale) , When asked a fact-seeking question, Gemini can fetch documents and even show the user source links supporting its answer. This design explicitly aims to reduce hallucinations by *training models to rely on source , In fact, Google Research noted they trained models to do summarization tasks with an approach that forces close adherence to pro , Another approach combined structured data (like knowledge graphs) with language models to improve -, ensuring that factual queries could be answered with database-like precision.

One notable initiative is DataGemma (Google, Sep 2024), described as the first open models specifically designed to combat hallucination via real-worl (DataGemma: AI open models connecting LLMs to Google’s Data Commons), DataGemma connects LLMs to Google’s Data Commons (a vast repository of factual statistical data). It employs Retrieval-Interleaved Generation (RIG): when the model is prompted with a question involving a statistic or factual claim, it automatically triggers a retrieval of that data from the Data Commons, and interleaves it into the answer , For example, if asked “What is the population of X in 2022?”, the system will fetch the exact number from an authoritative source during generation instead of relying on the model’s memory (which might be wrong or outdated). By doing so, DataGemma effectively eliminates hallucination for statistical questions – the answers come directly from ve , They also integrate standard RAG (retrieve relevant context passages) using Gemini’s long context window to provide background info before the mo , The result was more comprehensive outputs with footnoted facts, and a clear reduction in made-up content, as reported by Google. This demonstrates how coupling LLMs with domain-specific databases (e.g. medical knowledge graphs, financial filings, etc.) can greatly enhance factual accuracy.

However, RAG is not a silver bullet; it introduces new challenges: garbage in, garbage out. If the retrieval step brings back irrelevant or low-quality documents, the model might still hallucinate or incorporate incorrect info. A 2024 paper titled “Corrective RAG” explicitly addressed this: it noted that while RAG is effective, it “relies heavily on the relevance of retrieved documents” – if retrieval fails, the model can hallucinate using whatever was retrieved or fallback to its parametri (Corrective Retrieval Augmented Generation), To mitigate this, the authors proposed a pipeline that first **evaluates the retrieved documents , This RAG-refinement uses a lightweight retriever-checker that assigns a confidence score to the retrieved set; if confidence is low, the system can decide to perform a broader web search or take alternat , In addition, they introduced a decompose-then-recompose algorithm: essentially breaking the query into sub-queries, retrieving pieces, and then filtering and recombining the info. This helps focus the model on the truly relevant facts and filter out distracti , Such an approach makes RAG more robust – the model is less likely to be led astray by an off-topic passage among the retrieved context. Experiments showed this improved factual performance significantly on both short QAs and long-form gener ,

We also see RAG blending into user-facing features. Many enterprise applications now use an LLM with a vector database of company-specific knowledge: the user’s question is converted to an embedding, relevant internal documents are fetched, and the model is prompted with those docs to answer. This has proven extremely useful to prevent hallucinations in domains where incorrect answers are unacceptable (e.g. an LLM-powered assistant for physicians that always retrieves latest medical guidelines for reference). The retrieved text essentially anchors the model. A study on medical QA (2024) found that Search-Augmented Generation (having the model do a web search as part of answering) markedly reduced hallucination rates in clinica (Medical Hallucination in Foundation Models and Their Impact on Healthcare | medRxiv), It allows the model to correct or double-check its own uncertain knowledge before finalizing an answer.

In summary, RAG has become a cornerstone of hallucination mitigation. By expanding the context window with relevant, authoritative information, it turns the open-ended text generation problem into a grounded question-answering problem. The past two years saw improvements like dynamic retrieval (searching when needed), large context handling (feeding models 100k+ tokens of reference material), and mixed-modal retrieval (e.g. pulling both text and data tables) – all aimed at giving the model the best chance to stay factual. The payoff is clear in benchmarks: grounded models like Google’s Gemini 2 scored 83.6% correctness on the FACTS factuality benchmark, far higher than earlier models, precisely due to these grounding (Google Research 2024: Breakthroughs for impact at every scale), As we continue, it’s likely every high-end LLM system will have a retrieval component, especially for domains where up-to-date information is critical. Pure end-to-end generative models (without retrieval) are simply too prone to making stuff up when they hit the boundaries of their training knowledge.

🔎 Post-hoc Verification Pipelines

Even with good prompting, calibration, and retrieval, it’s wise not to entirely trust a single pass generation from an LLM in critical contexts. Post-hoc verification refers to any process that checks or corrects the model’s output after it has been generated (or as a final step in generation). Instead of preventing hallucinations upfront, these methods catch and fix them just before the answer reaches the end user.

One class of verification methods uses a second model or system to fact-check the first model’s output. For example, after an LLM produces an answer, one can feed that answer (and maybe the original question) into a fact-checker. This could be a simpler classifier trained to detect factual errors, or another LLM prompt that says “verify each statement above and label it true/false”. In 2024, Microsoft researchers proposed a “Chain-of-Verification (CoVe)” framework that does something along (LLM Hallucinations: A Bug or A Feature? – Communications of the ACM), In their approach, the LLM’s answer is post-edited using a chain of **Natural Language Inference ( ( Chain of Natural Language Inference for Reducing Large Language Model Ungrounded Hallucinations)s. Essentially, for each sentence in the answer, they check if the sentence is entailed by the provided source texts. If not, that sentence is flagged as a hallucination. They then have the model rewrite those parts (or remove them) to better align with s. This hierarchical NLI-based verifier caught many unsupported claims that would slip through a single-pass system, and the rewriting step could correct the answer without human n. Importantly, this was done without fine-tuning the main LLM – it’s a plug-and-play wrapper, meaning one can improve an off-the-shelf model’s trustworthiness by adding this verification stage.

Another approach is self-consistency verification: have the same model verify its answer via a different prompt. We mentioned earlier the idea of prompting the model to double-check itself with citations (which is a form of inline verification). Post-hoc, one can do: “Here is my answer. Is each claim in it fully supported by known facts? If not, correct it.” This meta-prompt often works surprisingly well because the model, when put in a critical evaluator role, might catch mistakes it made in the responder role. It’s as if using the model as its own adversary. The Anthropic guide suggests iterative refinement: feed Claude its own output and ask it to expand or verify ce (Reduce hallucinations - Anthropic), This can uncover inconsistencies – for instance, the model might realize a previous statement doesn’t actually follow and revise it upon further ,

Connect with me on X (Twitter)

There’s also tool-assisted verification. LLMs can be integrated with external tools (via APIs or function calling). A classic example: after the model answers a question, automatically run a web search for the main facts it asserted. If the search results contradict the answer (or no evidence is found), that’s a red flag. Some systems will actually do this in real-time – e.g., an agent that uses the LLM to answer, then uses the LLM to formulate a search query about that answer, reads the search results, and if a discrepancy is found, either corrects the answer or at least attaches a warning. This kind of closed-loop verification is a bit complex but has been demonstrated in research prototypes. Essentially, the LLM is used in multiple roles (answerer and verifier) with external information fetched in between. We can think of it as an automated “fact-checking journalist” pipeline: write the story, then research every claim in the story.

For specific domains, specialized verifiers shine. For instance, in programming, you can compile or run generated code to see if it works – that immediately flags hallucinations in code outputs. In math, you can have the model’s answer checked by a math solver. In question-answering, some benchmarks use regex or knowledge-base constraints to validate answers (e.g., if the question asks for a date, ensure the answer is a date and matches known timelines).

A noteworthy point from late-2024 discussions: hallucinations can be seen as a form of adversarial example. Some researchers argue we should treat a model’s tendency to hallucinate as we treat adversarial robustness in v (LLM Hallucinations: A Bug or A Feature? – Communications of the ACM), This means actively stress-testing the model’s outputs and developing adversarial training or filtering to eliminate the most egregious failure modes. Under this view, a post-hoc verifier can be thought of as a “adversarial filter” catching the model’s mistakes.

One concrete implementation in the retail domain was the Knowledge Injection (KI) framework by Yext (2024). They created prompts for a chatbot responding to customer reviews that inject contextual data about specific store locations to keep ans , Essentially, before the model responds to a customer query like “Is this item in stock at store X?”, the system would fetch that store’s inventory or info and put it into the prompt. Then after the model drafts an answer, it could verify that every piece of info appears in the injected data. This led to improved accuracy , It’s a mix of retrieval (inject data) and verification (ensure answer stays within it), and it was successful enough that the authors suggest using smaller domain-specific models with such pipelines rather than giant general models for ,

In practice, many production LLM systems in 2025 use multi-step pipelines: retrieve info, generate draft, verify draft, maybe even second-pass refine, then output. While this adds latency and complexity, it dramatically increases reliability. Users of ChatGPT plugins have seen this: e.g. the “Browser” plugin essentially does an on-demand retrieval + verification by letting the model read search results and correct itself. Another example is Bing Chat’s approach: it queries Bing search and then uses an internal ranking of whether its answer is grounded in the snippets; if not, it adjusts the answer or cites sources.

Finally, an element of verification is transparency – if a model cites its sources (as some new systems do), the user can manually verify if needed. Anthropic announced that Claude 3 will support citations down to specific sentences in supporti (Introducing the next generation of Claude \ Anthropic)-, which allows an end-user to see exactly where a claim came from. This human-in-the-loop verification is not automatic, but it’s facilitated by the AI itself citing sources.

To wrap up, post-hoc verification acts as a safety net. It’s acknowledging that “even if my LLM sometimes goes off the rails, I have a mechanism to catch it before harm is done.” Much like an editor for a writer, the verifier reviews and fixes the content. This layered defense (generation + verification) is becoming the norm, especially as organizations deploy LLMs in sensitive workflows where unchecked hallucinations could be costly.

🔧 Fine-Tuning and Instruction Tuning Approaches

Beyond clever prompts and pipelines, a more direct way to reduce hallucinations is to train the model itself to be more truthful and cautious. Fine-tuning refers to updating the model’s weights on a task-specific dataset or with certain objectives, while instruction tuning usually refers to supervised fine-tuning on instruction-response pairs to make the model follow directions better. In , fine-tuning strategies have been employed to make LLMs less prone to hallucination.

One prominent strategy is reinforcement learning from human feedback (RLHF), which was instrumental in training ChatGPT and similar models. Part of the human feedback criteria often includes factuality – human annotators prefer answers that are correct and useful over those that are wrong. By training on these preferences, the model learns to avoid obvious hallucinations (since those would be rated poorly). OpenAI’s GPT-4, for instance, underwent extensive RLHF which likely included penalizing outputs that contained evident falsehoods or unsubstantiated content. However, RLHF by itself is not a panacea; it tends to fix more style and safety issues than complex factual errors. Still, it contributes: GPT-4 is observably more factually reliable than GPT-3.5, which is partly due to such alignment tuning.

More targeted is supervised fine-tuning on curated data. If you can create a training dataset of question → correct answer (ground-truth) pairs, you can fine-tune the model to reproduce those answers exactly, thus reinforcing factual knowledge. Many open-source efforts have done this on Wikipedia or QA benchmarks. The risk here is that the model might overfit to the training distribution and not generalize, but if the dataset is broad, it can instill a general habit of accuracy. A twist introduced in late 2023 and used in 2024 is fine-tuning on negative examples as well – training the model on prompts with bad answers and explicitly teaching it “this is a hallucination – don’t do this”. For instance, give the model a prompt and a fake answer that a lesser model might produce, and train it to output a correction or a refusal. This is akin to adversarial fine-tuning.

Open-source model developers (like Meta and Mistral) have explicitly focused on this. Mistral Large 2 (123B), released in 2024, involved a fine-tuning stage where the model was trained *to be more discerning and to say when it (Large Enough | Mistral AI) . This was done by augmenting the training data with instructions that required the model to either find the correct info or respond with uncertainty. According to Mistral’s report, this substantially minimized the model’s tendency to generate “plausible-sounding but factually in t. In effect, they adjusted the model’s temperature metaphorically – making it a bit more conservative in its generations. They also reported improved performance on mathematical and reasoning benchmarks as t, suggesting that focusing on correctness can also improve logical reasoning (since the model learns not to “wing it” as often, but rather carefully deduce answers or stop if unsure).

Anthropic’s Constitutional AI (a technique from 2022 that carried into their 2024 Claude models) is another fine-tuning approach that can reduce hallucinations indirectly. By giving the model a set of principles (a “constitution”) to follow, including things like honesty and not misleading the user, the model can be guided to avoid making unsupported claims. For example, one principle might be “If you don’t know the answer for sure, it’s better to say you aren’t certain.” The model is then refined (via self-critique and RL) to adhere to tho (Introducing the next generation of Claude \ Anthropic), While Constitutional AI is often discussed in context of avoiding harmful content, many of the principles chosen (like avoiding deception) directly translate to fewer hallucinations.

Researchers have also tried knowledge-aware fine-tuning. This might involve linking the model to a knowledge base during fine-tuning, so it learns to consult it. An example from 2024 is a method where the model was fine-tuned with a retrieval component (the Meta/NYU/UCL work mentioned in CACM) so that it “learns” to use an ex (LLM Hallucinations: A Bug or A Feature? – Communications of the ACM). By baking the retrieval step into training, the model can internalize the ability to call for info rather than guessing. This is a bit like training a student to always check the textbook – after enough training, it becomes second nature for the model to draw facts from the tool. The result was a model better at knowledge-intensive tasks with less halluc .

Another domain is domain-specific fine-tuning. If you have a model that will be used strictly in, say, legal Q&A, you can fine-tune it on a corpus of legal texts and known Q&A pairs. This specialization often reduces hallucination because the model gains depth in that domain’s knowledge and terminology. For instance, there are legal LLMs in 2024 fine-tuned on case law; they are less likely to invent fake cases because they’ve seen the real cases in training (although as the Stanford RegLab study showed, even those still hallucinate a lot if the queries go beyond their training (Hallucinating Law: Legal Mistakes with Large Language Models are Pervasive | Stanford HAI) -). Similarly, in medicine, fine-tuning on medical Q&A or dialogues with doctor feedback helps the model learn what is correct and what it should abstain from. Google’s Med-PaLM 2 (early 2024) is an example in healthcare – fine-tuned on medical knowledge and explicitly evaluated to reduce harmful inaccuracies.

A notable caution: fine-tuning must be done carefully, as studies have found it can sometimes increase hallucination if new information is introduced inconsistently. If you fine-tune an LLM on a small set of new facts, the model might paradoxically hallucinate more around those facts – one paper (Gekhman et al. 2024) observed that adding new knowledge via fine-tuning without sufficient context can lead to the model mixing up or over-generalizing t (Extrinsic Hallucinations in LLMs | Lil'Log)dge. This is because the model might treat the fine-tune data as absolute but doesn’t integrate it well with its existing knowledge. The takeaway is that fine-tuning data should be high-quality and comprehensive; otherwise, you might just shift the hallucination problem around.

Looking at industry “official” models: OpenAI hasn’t publicly discussed fine-tuning specifically to reduce hallucinations, but they did launch a fine-tuning API for GPT-3.5/4 in 2024 allowing companies to supply data to adjust the model. Many assume a use-case is to feed company-specific factual Q&A so the model stops guessing about company info. OpenAI likely uses a lot of internal fine-tuning (plus RLHF) in every model iteration to address factuality among other things. Similarly, Meta’s Llama 2 Chat model (mid-2023) was instruction-tuned on high-quality data and had relatively better factual consistency than base models – continuing that trend, any Llama 3 or other future models will leverage even larger curated instruction datasets (including factual question answering) to reduce hallucinations out-of-the-box.

In essence, fine-tuning and instruction tuning allow us to bake in anti-hallucination habits into the model parameters. It’s like training a student over time with feedback: the student gradually learns to be more truthful and careful. By 2025, the top models have undergone several rounds of such “education.” This is reflected in metrics – for instance, Anthropic noted Claude 3 nearly doubled accuracy on tricky factual questions versus Claude 2 after fine-tuning and other (Introducing the next generation of Claude \ Anthropic), That is a huge gain in a short period, underscoring how rapidly fine-tuning can push down hallucination rates when guided by the right data and objectives. The challenge, however, is that fine-tuning large models is expensive and not everyone can do it, so it’s often combined with the other methods we’ve discussed (which don’t require changing the model weights).

🗃️ Data Curation and Training Strategies

An often under-appreciated factor in hallucination is the composition of the training data itself. The phrase “garbage in, garbage out” applies: if an LLM is trained on poorly curated text, it will inherit misinformation and then amplify it. Conversely, careful data curation can make a model inherently less prone to hallucinate because it has seen mostly correct and validated information during training.

One straightforward strategy is filtering training data for quality. Many 2024-era models filtered out sources known to be unreliable (e.g. certain forums, synthetic text, etc.) and emphasized high-quality knowledge sources like Wikipedia, academic articles, and verified reference texts. By training more on factual corpora, the model’s internal knowledge base becomes more accurate. It’s not a guarantee – the model can still hallucinate if asked something it never saw – but at least the baseline factual accuracy is higher. For example, if a model’s training set has multiple independent articles stating a particular scientific fact, the model is more likely to confidently recall it correctly, instead of making up a conflicting claim.

Data curation also means deduplication and consistency. If the training data contains the same fact phrased in many ways (and not contradicted elsewhere), the model gets a stronger signal of it being true. But if one source says X and another says Y (for example, outdated info vs updated info), the model’s representation might average them out or be confused. 2024 research by OpenAI and others put effort into deduplicating training data to avoid such conflicts. Meta’s Llama-2 paper (2023) also mentioned they filtered and cleaned data. Although details are sparse (as companies often guard their data pipeline secrets), it’s understood that removing obvious falsehoods from the pretraining corpora is a priority. We can infer success in some cases: the fact that newer models know more up-to-date information (like recognizing that Twitter was renamed to X in 2023, etc.) suggests their data was refreshed to include those facts, whereas older models hallucinated on such queries (e.g. saying Twitter is still c (Grounding LLMs to In-prompt Instructions: Reducing Hallucinations Caused by Static Pre-training Knowledge) .

A specific example highlighting data issues: a 2024 study from University of Cambridge (Weller et al., cited in an ACL paper) looked at LLMs being given new information in a prompt that conflicts with their training data. They found models often ignore the prompt and stick to their static pre-training knowledge, causing hallucinations (like continuing to claim Twitter wasn’t renamed even when the promp . To combat this, they experimented with representing knowledge in structured forms (like knowledge graphs) and grounding the model during training on those structures. Essentially, if the model at training time sees that certain knowledge is dynamic or comes from a knowledge graph, it can learn to defer to that rather than its parametric memory. Their results in healthcare and finance questions showed improved accuracy by up to 24–28% through such prompt-grounding and knowledge-engin . This suggests that training (or fine-tuning) on representations that clearly distinguish enduring knowledge vs prompt-provided facts can help. In practice, that could mean adding special tokens around factual statements in training to mark their source, or training with an auxiliary loss to encourage the model to copy from sources for factual questions.

Another data tactic is balanced representation of edge cases. If we know certain queries cause hallucination, we can include similar examples with correct answers in the fine-tuning data. For instance, models notoriously hallucinate references (fake book or paper citations) because in training data, references often appear as text and the model learns the format but not the authenticity. To fix this, one could fine-tune on a dataset of prompts asking for references where the correct behavior is to either provide a real reference or say “none available.” OpenAI reportedly did something along these lines for their ChatGPT (after early users found it inventing academic references, they adjusted the model to refuse or correctly answer reference requests). By 2025, ChatGPT will usually say it cannot create a reference if one isn’t given, which is a learned behavior likely from training on examples of that scenario.

Data augmentation is also used: generating additional training examples that target factual consistency. A creative method is to have an LLM generate a bunch of QA pairs from a knowledge source and then train on those. This “self-augmentation” was used in some 2024 works to inject more factual QAs into the fine-tune set, thereby reinforcing the truth. Care needs to be taken to verify those generations though, or else you could accidentally feed the model its own hallucinations in a loop.

Finally, continual training on new data helps reduce hallucinations about recent events. One reason models hallucinate is because their knowledge cutoff is old. If asked about something that happened after training, a model might just guess. By continuously updating the training data with new information (or using RAG as discussed), we mitigate that. OpenAI’s strategy with plugins and browsing is one way (retrieve instead of guess). Others have done periodic model refreshes: e.g. Meta might release updated Llama models with more recent data. The more current the training data, the less the model has to hallucinate to cover events it missed.

It’s also worth mentioning scale vs data quality: A larger model with more parameters can memorize more data and thus reduce some hallucinations, but simply scaling up without good data only goes so far. In 2023 it was estimated that GPT-3.5 and GPT-4 still had hallucination rates around 20-30% in general knowledge tasks. By late 2024, those numbers have come down with technique improvements, not just scaling. This reinforces that smart data curation and training strategy is as important as model size. As one article put it, “No matter how much you scale up, even the biggest LLM (Solving the Hallucination Problem Once and for all using Smart ...)e” – which is why the focus has shifted to techniques like we’re discussing.

In sum, the training data is the foundation: ensuring it is as factual, consistent, and up-to-date as possible gives the model a strong baseline. Then techniques like RAG and verification build on that foundation. When evaluating any LLM, it’s now common to ask the providers what data and process they used to address hallucination during training. For high-stakes models (medical, legal, etc.), often a specialty corpus is included in training (e.g. medical textbooks, law libraries) so that the model is less likely to “wing it” in those areas and more likely to pull from real knowledge. The combination of these behind-the-scenes efforts has contributed to the measurable drop in hallucination rates in the latest generation of LLMs.

⚠️ High-Stakes Domains: Why Hallucination Is Critical

In casual applications (like chatting about trivia or writing fiction), a hallucination from an LLM is relatively harmless – it might even be amusing. But in high-stakes domains such as healthcare, law, and finance, hallucinations can lead to serious consequences. The past two years have underscored why it is imperative to tackle hallucinations before deploying LLMs in these fields.

Healthcare: A hallucinated detail in a medical context could literally be life-threatening. Imagine an LLM-powered assistant summarizing a patient’s records – if it fabricates an allergy or omits a crucial symptom, it could mislead a physician. Or consider patient-facing applications: if a chatbot gives a patient incorrect medication instructions that sound authoritative, the patient might follow them and suffer harm. A 2024 study defines medical hallucination as any instance where an AI generates misleading medical content, and it emphasizes how these errors “impact clinical decisions and p (Medical Hallucination in Foundation Models and Their Impact on Healthcare | medRxiv). Examples include hallucinating a patient history that wasn’t in the chart (potentially leading a doctor to a wr sis), or suggesting a non-existent treatment. Medical knowledge is vast and constantly evolving; no model can know it all, which is why hallucinations occur particularly when the model is unsure. The same study found that using chain-of-thought reasoning and search augmentation helped reduce medical hallucinations, but even then “non-trivial levels of hallucin . This underscores that in medicine, even a small percentage of hallucinations is a serious risk. There’s an ethical imperative: if an AI is advising on healthcare, it must either be correct or clearly state uncertainty. Hallucinations break that rule by presenting falsehoods confidently. Regulatory bodies are starting to look at this – guidelines for medical AI may soon require a demonstration of extremely low hallucination rates and the presence of verification steps. Until AI can reliably not hallucinate, fully autonomous medical AI is off the table. Instead, current efforts like Google’s Med-PaLM and others focus on AI as an assistant that always provides supporting evidence and never overrules human judgment.

Connect with me on X (Twitter)

Legal: The legal domain got a dramatic example in mid-2023, when a lawyer submitted a brief written by ChatGPT that cited completely fictitio (Hallucinating Law: Legal Mistakes with Large Language Models are Pervasive | Stanford HAI). The incident (the “ChatGPT lawyer”) highlighted how an LLM’s hallucination – in this case inventing precedent – can wreak havoc in the legal process. In law, accuracy of citations and interpretation of facts is paramount; a hallucination can mean a mistrial, sanctions, or bad legal advice leading someone to break the law unknowingly. A comprehensive study in early 2024 by Stanford’s RegLab systematically tested GPT-3.5, PaLM2, and Llama2 on legal tasks and found “disturbing and per . The hallucination rates were shockingly high: 69% to 88% on specific, verifiable legal queries acros . In other words, these LLMs most often could not be trusted with straightforward legal Q&A – they would fabricate statutes, misquote holdings, or assert false legal rules more than . They also noted the models lacked self-awareness of these mistakes and tended to be . This level of unreliability in a domain where outcomes affect people’s rights, freedom, or finances is unacceptable. The study further analyzed that hallucination rates were worse for complex legal reasoning (like analyzing relationships between cases) and for less we . Essentially, whenever the legal question went beyond surface-level facts (which the model might have memorized) into deeper reasoning, the models fell apart and guessed. Lower courts and niche jurisdictions had poorer accuracy, likely because the model saw less of tho . This suggests current LLMs don’t truly understand law; they just regurgitate patterns, which is dangerous in law. As a result, there is heavy caution in using LLMs in legal practice. Many firms use them only for first drafts with intense human review after, and there’s interest in smaller models fine-tuned on vetted legal databases to avoid the general mo (LLM Hallucinations: A Bug or A Feature? – Communications of the ACM)s. Judges and regulators, as noted by the Chief Justice, are very wary of AI “hallucinations” polluting leg . We can expect standards to emerge (perhaps via the bar association or AI oversight bodies) for validating any AI-generated legal document to ensure no hallucinated citations or arguments are included.

Finance: In finance and banking, factual errors can lead to monetary loss or regulatory violation. If an AI assistant to an investor hallucinated a false statistic (like a company’s earnings or a market trend), it could prompt a bad investment decision. Or an AI used in a bank’s customer service might hallucinate policy details, causing the bank to be out of compliance if the info is wrong. Researchers have noted that LLMs have “deficiencies in finance” with empirical examinations showing frequent hallucinations (Medical Hallucination in Foundation Models and Their Impact on Healthcare | medRxiv)ain. Financial language often requires precision and up-to-date data (stock prices, economic indicators), which models might not have, leading them to fabricate plausible-sounding numbers or just use training data that’s outdated. A specific risk is that an AI might hallucinate the contents of a law or regulation when answering a compliance question – giving a false sense of security to a user. Or it might miscalculate something (e.g. a tax or interest formula) and not flag that error. Financial industries are thus very cautious: any AI outputs typically go through validation. Some are exploring a retrieval-based approach here too: for example, connecting the LLM to a database of financial regulations and company financials, so that any answer is grounded. Hallucinations in finance could also move markets if taken seriously (imagine AI-generated news that’s false but causes trading activity). Therefore, “truthfulness” is not just nice-to-have but critical for AI in finance.

Other high-stakes areas include education (where a student using an AI tutor could be mis-taught a subject if the AI hallucinates an explanation – it’s critical for learning that the info is correct) and scientific research (where an AI assistant might hallucinate experimental results or citations, potentially derailing research until the error is caught). A Nature article in 2024 pointed out that hallucinations in AI “have the potential to degrade science” by flooding scientific discourse with plausible-sounding (LLM Hallucinations: A Bug or A Feature? – Communications of the ACM)n. They argued that researchers should treat LLM outputs as draft translations of knowledge rather than source o h – a perspective that in science, one should always verify AI outputs against original data or literature.

In summary, the higher the stakes, the lower tolerance for hallucination. The recent developments we’ve discussed (prompting, RAG, verification, etc.) are often driven by the needs of these domains. It’s no coincidence that we see benchmarks like the FACTS Grounding leaderboard and *Hallucination leader (Google Research 2024: Breakthroughs for impact at every scale) – these are tools to measure progress in factual accuracy for precisely such sensitive applications. The good news is that by late 2024, models like Gemini 2 and Claude 3 are much better on factuality benchmarks than their predecessors, showing that the concerted efforts are paying off. But the work is ongoing: every reduction from, say, a 5% hallucination rate to a 1% rate in a medical assistant could save lives or prevent serious errors. The goal for high-stakes deployment is not just minimizing hallucinations, but also having robust systems to catch any that slip through. Until that is achieved, human oversight remains a requirement.

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post

Ready for more?

Rohan's Bytes

Hallucination in LLMs

Table of Contents

🤔 Why Do LLMs Hallucinate?

💡 Prompt Engineering to Curb Hallucinations

🎛️ Calibration and Uncertainty Estimation

🔗 Retrieval-Augmented Generation (RAG) Advances

🔎 Post-hoc Verification Pipelines

🔧 Fine-Tuning and Instruction Tuning Approaches

🗃️ Data Curation and Training Strategies

⚠️ High-Stakes Domains: Why Hallucination Is Critical

Discussion about this post

Ready for more?