Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
SummHay, proposed in this paper, reveals how LLMs and RAG systems struggle with large-scale document comprehension.
SummHay, proposed in this paper, reveals how LLMs and RAG systems struggle with large-scale document comprehension.
Shows current models miss 50% of insights in large document processing
Original Problem 🎯:
Current evaluation methods for long-context LLMs and Retrieval Augmented Generation (RAG) systems lack complexity and depth. Simple tasks like needle-in-haystack don't effectively differentiate model capabilities.
Solution in this Paper 🔧:
• Introduces "Summary of a Haystack" (SummHay) task requiring systems to process ~100k tokens
• Synthesizes document collections with controlled information distribution
• Implements automatic evaluation protocol measuring Coverage and Citation accuracy
• Generates Haystacks in conversation and news domains
• Each Haystack contains ~100 documents with specific insights repeating across documents
Key Insights 💡:
• Even with oracle document relevance signals, systems underperform human benchmark by 10+ points
• RAG systems improve citation quality but often sacrifice insight coverage
• Position bias confirmed - LLMs favor information at document extremities
• Advanced RAG components like Cohere's Rerank3 boost end-to-end performance
• Human performance achieves 56% Joint Score vs best system at 44.6%
Results 📊:
• Long-context LLMs score below 20% without retriever
• Top models achieve 30.8-36.0 Joint Score with Rerank3 RAG
• Claude3 Opus leads in Coverage (76.5%)
• Gemini-1.5-pro excels in Citation (49.7%)
• Human annotators achieve 74.5% Coverage, 73.9% Citation scores