LongBench v2 tests if LLMs actually understand text or just play matching games
LongBench v2 introduces 503 challenging multiple-choice questions requiring deep reasoning on contexts from 8k to 2M words, setting a new standard for evaluating LLMs' true comprehension abilities.
-----
https://arxiv.org/abs/2412.15204
Original Problem 🤔:
Current benchmarks fail to test if LLMs truly understand long texts beyond simple information extraction. Most focus on basic retrieval tasks that modern models easily solve.
-----
Solution in this Paper 🛠️:
→ Created a comprehensive benchmark with 503 multiple-choice questions across 6 major categories including document QA, code understanding, and structured data reasoning
→ Employed 97 highly educated annotators and 24 expert reviewers to ensure question quality and difficulty
→ Implemented rigorous data collection process combining automated and manual reviews
→ Questions require deep reasoning rather than simple pattern matching or retrieval
-----
Key Insights 💡:
→ Human experts achieve only 53.7% accuracy under 15-minute constraint
→ Chain-of-thought prompting improves performance by 3.4% for open-source models
→ Models perform best on contexts <32k words but struggle with longer contexts
→ Performance varies significantly across tasks - models match humans on document QA but lag on structured data
-----
Results 📊:
→ Best model (o1-preview) achieves 57.7% accuracy with longer reasoning
→ Direct answer approach yields 50.1% accuracy
→ Models struggle most with contexts between 32k-128k words
→ RAG approaches show limited effectiveness
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/