Want to know if an AI is lying? LLM-OASIS helps detect factual accuracy in AI outputs with 81k training examples.
LLM-OASIS introduces the largest dataset for training factuality evaluators, created by extracting and falsifying information from Wikipedia articles. This enables end-to-end verification of AI-generated text accuracy.
-----
https://arxiv.org/abs/2411.19655
🤔 Original Problem:
LLMs still produce hallucinations in their outputs. Existing factuality evaluation resources are limited by being task-specific, small in size, or focused only on simple claim verification.
-----
🔧 Solution in this Paper:
→ LLM-OASIS extracts claims from Wikipedia passages using an LLM-based pipeline.
→ The system falsifies selected claims by introducing subtle but critical factual errors.
→ It generates pairs of factual and unfactual texts based on the original and modified claims.
→ The dataset covers 81k Wikipedia pages with 681k claims for training factuality evaluators.
-----
💡 Key Insights:
→ Task-agnostic factuality evaluation is possible with a large-scale synthetic dataset
→ Wikipedia provides reliable source material for generating factual/unfactual pairs
→ Human validation confirms high quality of automated data generation (90%+ accuracy)
-----
📊 Results:
→ GPT-4 achieves 60% accuracy on end-to-end factuality evaluation
→ 68% accuracy with Retrieval Augmented Generation
→ Human validation shows 96.78% accuracy for claim extraction
→ Dataset creation pipeline maintains 89-98% accuracy across all steps
Share this post