Distinguishing Ignorance from Error in LLM Hallucinations
Is your LLM lying because it's clueless or just being silly? WACK knows!
Is your LLM lying because it's clueless or just being silly? WACK knows!
WACK helps distinguish between LLM hallucinations caused by ignorance versus computational errors
This method separates knowledge-based and processing-based hallucinations in LLMs
π€ Original Problem:
LLMs generate hallucinations in two ways: when they lack knowledge (HK-) and when they have knowledge but still give wrong answers (HK+). Current research doesn't distinguish between these types, leading to ineffective mitigation strategies.
π§ Solution in this Paper:
β Introduces WACK (Wrong Answers despite having Correct Knowledge) to build model-specific datasets
β Uses multiple temperature sampling to detect if model knows correct answer
β Employs bad-shots technique: adds incorrect QA pairs to context to trigger HK+ hallucinations
β Implements Alice-Bob setup: uses persuasion and weak semantics as alternate trigger method
β Trains probes on model's inner states to detect hallucination types
π‘ Key Insights:
β HK- needs external knowledge sources while HK+ can be fixed through internal computation
β Models share 60-65% knowledge but differ in hallucination patterns
β Inner states contain distinct representations for different hallucination types
β Model-specific datasets outperform generic ones in hallucination detection
π Results:
β Achieves 60-70% accuracy in 3-way classification of hallucination types
β Shows 70%+ accuracy in binary classification between any two types
β Demonstrates successful preemptive detection on TriviaQA dataset
β Generic datasets perform at random level in preemptive detection
The Paper focusses on whether the model
(1) does not hold the correct answer in its parameters or
(2) answers incorrectly despite having the required knowledge.
Paper argues that distinguishing these cases is crucial for detecting and mitigating hallucinations.
Specifically, case (2) may be mitigated by intervening in the modelβs internal computation, as the knowledge resides within the modelβs parameters.
In contrast, in case (1) there is no parametric knowledge to leverage for mitigation, so it should be addressed by resorting to an external knowledge source or abstaining
π οΈ The WACK methodology workflow:
β Tests model's knowledge through multiple generations with different temperatures
β Labels examples as HK- if model never generates correct answer
β Uses bad-shots and Alice-Bob prompting techniques to induce HK+ hallucinations
β Bad-shots method uses incorrect QA pairs in context
β Alice-Bob setup uses persuasion and weak semantics to trigger hallucinations