Can Knowledge Editing Really Correct Hallucinations?
Popular LLM editing methods can't actually fix wrong answers. This paper tries to solve it.

Popular LLM editing methods can't actually fix wrong answers. This paper tries to solve it.
New dataset with 6000+ verified hallucinations reveals huge gaps in LLM knowledge correction and that LLM editing methods fail across multiple dimensions.
Original Problem 🔍:
Knowledge editing methods for LLMs lack proper evaluation on real hallucinations. Current datasets don't verify if models actually generate incorrect answers before editing, making it difficult to assess effectiveness in fixing real-world hallucinations.
i.e. current tests try to "fix" LLM answers without checking if they were wrong first
Solution in this Paper 🛠️:
• Created HalluEditBench - a benchmark with 9 domains, 26 topics, 6,000+ verified hallucinations
• Tests 7 editing methods (FT-L, FT-M, MEMIT, ROME, LoRA, ICE, GRACE) on 3 LLMs
• Evaluates across 5 dimensions:
Efficacy: Tests correction accuracy
Generalization: Checks edit persistence across question types
Portability: Measures downstream reasoning effects
Locality: Assesses impact on unrelated knowledge
Robustness: Tests resistance to manipulations
Key Insights from this Paper 💡:
• Current assessment methods are unreliable - high performance on existing datasets doesn't reflect real hallucination correction
• No single editing method excels across all dimensions
• Performance heavily depends on domain and specific LLM
• Parameter-preserving methods (ICE, GRACE) show better efficacy but poor robustness
Results 📊:
• ICE and GRACE outperform others in Efficacy
• Only ICE improves Generalization performance
• Most methods underperform on Portability compared to pre-edit state
• FT-M and ICE lead in Locality (80% score on Mistral-v0.3-7B)
• ICE shows poor Robustness against manipulations
Knowledge editing techniques
No single editing method excels across all dimensions:
ICE and GRACE perform best on Efficacy
Only ICE improves Generalization
Most methods underperform on Portability
FT-M and ICE lead on Locality
ICE shows poor Robustness