This benchmark forces LLMs to prove they can be trusted with long document comprehension.
FACTS Grounding evaluates LLMs' ability to generate factually accurate responses from long documents up to 32,000 tokens while staying true to the provided context.
-----
https://arxiv.org/abs/2501.03200
🤔 Original Problem:
LLMs often struggle with factual accuracy when generating responses from long documents. Existing benchmarks focus on narrow use cases like summarization, lacking comprehensive evaluation of factual grounding across diverse scenarios.
-----
💡 Solution in this Paper:
→ The benchmark tests LLMs through a two-phase evaluation system using automated judge models
→ Phase 1 disqualifies responses that fail to fulfill user requests
→ Phase 2 judges responses for factual accuracy based on strict grounding in provided documents
→ Multiple judge models (Gemini 1.5 Pro, GPT-4o, Claude 3.5 Sonnet) evaluate responses to reduce bias
→ The benchmark includes 860 public and 859 private examples with documents averaging 2,500 tokens
-----
🔍 Key Insights:
→ Models tend to rate their own outputs higher (+3.23% bias)
→ Disqualifying ineligible responses reduces factuality scores by 1-5%
→ Long-form response evaluation requires thorough inspection of each claim
→ Data contamination is addressed through novel user requests and system instructions
-----
📊 Results:
→ Gemini 2.0 Flash Experimental achieved 83.6% factuality score
→ Gemini 1.5 Flash ranked second at 82.9%
→ OpenAI models showed lower performance around 62%
------
Are you into AI and LLMs❓ Join me on X/Twitter with 52K+ others, to remain on the bleeding-edge of AI every day.
𝕏/🐦 https://x.com/rohanpaul_ai
Share this post