BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

BABILong shows popular LLMs only use 10-20% of their advertised context window effectively

Nov 11, 2024

BABILong shows popular LLMs only use 10-20% of their advertised context window effectively

The paper also shows, recurrent models process 11M tokens while attention-based LLMs struggle beyond 16K

Original Problem 🎯:

Current benchmarks only test LLMs up to 40,000 tokens while models can handle hundreds of thousands of tokens. Existing "needle-in-haystack" tests are too simplistic to effectively evaluate long-context processing capabilities.

Solution in this Paper 🔍:

• Introduces BABILong benchmark with 20 reasoning tasks testing fact-chaining, induction, deduction, counting

• Hides relevant facts inside background text from PG19 books

• Scales tasks to any desired length to evaluate new models

• Provides splits up to 1 million tokens

• Tests capabilities like multi-hop reasoning and temporal understanding

• Open-source implementation available at github.com/booydar/babilong

Key Insights 💡:

• Popular LLMs effectively use only 10-20% of their claimed context window

• Performance drops sharply with increased reasoning complexity

• Retrieval-Augmented Generation achieves only 60% accuracy on single-fact questions

• Small fine-tuned models (RMT 137M, Mamba 130M) outperform larger models

• Recurrent approaches handle extremely long sequences better than attention-based models

Results 📊:

• RMT processes sequences up to 11 million tokens effectively

• GPT-4 shows 85% accuracy only up to 16K tokens

• Most models achieve <30% accuracy beyond 64K tokens

• Fine-tuned Mamba achieves best overall results but limited to 128K tokens

• RAG methods fail on multi-hop reasoning tasks

🌟 BABILong contains 20 reasoning tasks testing capabilities like:

Fact chaining
Simple induction
Deduction
Counting
Handling lists/sets

Rohan's Bytes

Discussion about this post