Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems

SummHay, proposed in this paper, reveals how LLMs and RAG systems struggle with large-scale document comprehension.

Nov 11, 2024

SummHay, proposed in this paper, reveals how LLMs and RAG systems struggle with large-scale document comprehension.

Shows current models miss 50% of insights in large document processing

Original Problem 🎯:

Current evaluation methods for long-context LLMs and Retrieval Augmented Generation (RAG) systems lack complexity and depth. Simple tasks like needle-in-haystack don't effectively differentiate model capabilities.

Solution in this Paper 🔧:

• Introduces "Summary of a Haystack" (SummHay) task requiring systems to process ~100k tokens

• Synthesizes document collections with controlled information distribution

• Implements automatic evaluation protocol measuring Coverage and Citation accuracy

• Generates Haystacks in conversation and news domains

• Each Haystack contains ~100 documents with specific insights repeating across documents

Key Insights 💡:

• Even with oracle document relevance signals, systems underperform human benchmark by 10+ points

• RAG systems improve citation quality but often sacrifice insight coverage

• Position bias confirmed - LLMs favor information at document extremities

• Advanced RAG components like Cohere's Rerank3 boost end-to-end performance

• Human performance achieves 56% Joint Score vs best system at 44.6%

Results 📊:

• Long-context LLMs score below 20% without retriever

• Top models achieve 30.8-36.0 Joint Score with Rerank3 RAG

• Claude3 Opus leads in Coverage (76.5%)

• Gemini-1.5-pro excels in Citation (49.7%)

• Human annotators achieve 74.5% Coverage, 73.9% Citation scores

Rohan's Bytes

Discussion about this post