0:00
/
0:00
Transcript

"LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks"

Generated below podcast on this paper with Google's Illuminate.

LongBench v2 tests if LLMs actually understand text or just play matching games

LongBench v2 introduces 503 challenging multiple-choice questions requiring deep reasoning on contexts from 8k to 2M words, setting a new standard for evaluating LLMs' true comprehension abilities.

-----

https://arxiv.org/abs/2412.15204

Original Problem 🤔:

Current benchmarks fail to test if LLMs truly understand long texts beyond simple information extraction. Most focus on basic retrieval tasks that modern models easily solve.

-----

Solution in this Paper 🛠️:

→ Created a comprehensive benchmark with 503 multiple-choice questions across 6 major categories including document QA, code understanding, and structured data reasoning

→ Employed 97 highly educated annotators and 24 expert reviewers to ensure question quality and difficulty

→ Implemented rigorous data collection process combining automated and manual reviews

→ Questions require deep reasoning rather than simple pattern matching or retrieval

-----

Key Insights 💡:

→ Human experts achieve only 53.7% accuracy under 15-minute constraint

→ Chain-of-thought prompting improves performance by 3.4% for open-source models

→ Models perform best on contexts <32k words but struggle with longer contexts

→ Performance varies significantly across tasks - models match humans on document QA but lag on structured data

-----

Results 📊:

→ Best model (o1-preview) achieves 57.7% accuracy with longer reasoning

→ Direct answer approach yields 50.1% accuracy

→ Models struggle most with contexts between 32k-128k words

→ RAG approaches show limited effectiveness

------

Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓

🎉 https://rohanpaul.substack.com/

Discussion about this video

User's avatar