0:00
/
0:00
Transcript

"Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis"

The podcast on this paper is generated with Google's Illuminate.

Want to know if an AI is lying? LLM-OASIS helps detect factual accuracy in AI outputs with 81k training examples.

LLM-OASIS introduces the largest dataset for training factuality evaluators, created by extracting and falsifying information from Wikipedia articles. This enables end-to-end verification of AI-generated text accuracy.

-----

https://arxiv.org/abs/2411.19655

🤔 Original Problem:

LLMs still produce hallucinations in their outputs. Existing factuality evaluation resources are limited by being task-specific, small in size, or focused only on simple claim verification.

-----

🔧 Solution in this Paper:

→ LLM-OASIS extracts claims from Wikipedia passages using an LLM-based pipeline.

→ The system falsifies selected claims by introducing subtle but critical factual errors.

→ It generates pairs of factual and unfactual texts based on the original and modified claims.

→ The dataset covers 81k Wikipedia pages with 681k claims for training factuality evaluators.

-----

💡 Key Insights:

→ Task-agnostic factuality evaluation is possible with a large-scale synthetic dataset

→ Wikipedia provides reliable source material for generating factual/unfactual pairs

→ Human validation confirms high quality of automated data generation (90%+ accuracy)

-----

📊 Results:

→ GPT-4 achieves 60% accuracy on end-to-end factuality evaluation

→ 68% accuracy with Retrieval Augmented Generation

→ Human validation shows 96.78% accuracy for claim extraction

→ Dataset creation pipeline maintains 89-98% accuracy across all steps

Discussion about this video