HALOGEN is a comprehensive benchmark with automated verifiers that decomposes and analyzes LLM outputs into atomic facts to detect and classify hallucinations across diverse tasks.
https://arxiv.org/abs/2501.08292
Methods in this Paper 🔧:
→ HALOGEN tests LLMs on 9 different domains like coding, summarization, and scientific citations.
→ For each domain, it breaks down model outputs into atomic units (like package names or factual statements).
→ These units are then automatically verified against trusted knowledge sources.
→ Hallucinations are classified as Type A (incorrect recall), Type B (bad training data), or Type C (fabrication).
Key Insights from this Paper:
→ Even the best LLMs hallucinate 4-86% of facts depending on domain
→ No single domain predicts hallucination behavior in other domains
→ Larger models generally hallucinate less on response tasks
→ Open-source models lag behind closed models in factual accuracy
Results 📊:
→ GPT-4 hallucination rates: 4% (coding) to 86% (scientific citations)
→ 91% verification accuracy for summarization
→ 92% accuracy for text simplification
→ 83% accuracy for historical events
Share this post