Real-world context is messier than our test datasets - DRUID proves it
This paper introduces DRUID, a dataset that tests how well LLMs actually use real-world context in retrieval tasks, revealing synthetic test data often misrepresents real scenarios.
-----
https://arxiv.org/abs/2412.17031v1
Original Problem 🤔:
Current studies of how LLMs use retrieved context rely on artificial datasets that don't reflect real-world complexity. This limits our understanding of true RAG performance.
-----
Solution in this Paper 🔍:
→ Created DRUID dataset with real-world claims and evidence from fact-checking scenarios
→ Developed a new metric called ACU (Accumulated Context Usage) to measure how well LLMs utilize context
→ Compared DRUID against synthetic datasets like CounterFact and ConflictQA
→ Analyzed multiple context characteristics including relevance, stance, readability, and source reliability
-----
Key Insights from this Paper 💡:
→ Synthetic datasets exaggerate context characteristics rare in real retrieved data
→ Real-world contexts are more complex, longer, and have more uncertainty markers
→ Memory conflicts occur less frequently in real scenarios (58%) vs synthetic ones (97%)
→ Source characteristics matter more than individual context properties
→ LLMs show different behavior patterns with real vs synthetic contexts
-----
Results 📊:
→ Synthetic datasets show inflated context utilization scores
→ Real-world context usage patterns differ significantly from synthetic test results
→ Llama shows better context usage than Pythia across all datasets
→ Context-repulsion is rarer in realistic scenarios
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post