"LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks"

Playback speed

Share post at current time

0:00

Transcript

"LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 13, 2025

LongBench v2 tests if LLMs actually understand text or just play matching games

LongBench v2 introduces 503 challenging multiple-choice questions requiring deep reasoning on contexts from 8k to 2M words, setting a new standard for evaluating LLMs' true comprehension abilities.

-----

https://arxiv.org/abs/2412.15204

Original Problem 🤔:

Current benchmarks fail to test if LLMs truly understand long texts beyond simple information extraction. Most focus on basic retrieval tasks that modern models easily solve.

-----

Solution in this Paper 🛠️:

→ Created a comprehensive benchmark with 503 multiple-choice questions across 6 major categories including document QA, code understanding, and structured data reasoning

→ Employed 97 highly educated annotators and 24 expert reviewers to ensure question quality and difficulty

→ Implemented rigorous data collection process combining automated and manual reviews

→ Questions require deep reasoning rather than simple pattern matching or retrieval

-----

Key Insights 💡:

→ Human experts achieve only 53.7% accuracy under 15-minute constraint

→ Chain-of-thought prompting improves performance by 3.4% for open-source models

→ Models perform best on contexts <32k words but struggle with longer contexts

→ Performance varies significantly across tasks - models match humans on document QA but lag on structured data

-----

Results 📊:

→ Best model (o1-preview) achieves 57.7% accuracy with longer reasoning

→ Direct answer approach yields 50.1% accuracy

→ Models struggle most with contexts between 32k-128k words

→ RAG approaches show limited effectiveness

------

Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓

🎉 https://rohanpaul.substack.com/

Rohan's Bytes

"LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks"

Discussion about this video