New tests reveal the true effective context limits of leading LLMs.
→ GPT-4 performs best at small contexts, Gemini 1.5 Pro excels at longer contexts
→ Claude 3.5 Sonnet leads in mid-range contexts (2.5k to 32k tokens)
https://arxiv.org/abs/2411.05000
🎯 Original Problem:
Current benchmarks for evaluating LLMs' long context capabilities are inadequate - they either saturate at perfect scores, test on limited context lengths, or lack granular insights into specific model behaviors.
-----
🔬 Solution in this Paper:
→ Introduced a series of increasingly complex retrieval tasks using synthetic UUID key-value pairs to test 17 leading LLMs
→ Created novel "needle threading" tasks where models must follow chains of linked information through contexts up to 900k tokens
→ Developed tasks like Single Needle (basic retrieval), Multiple Needles (concurrent retrieval), and Threading (following information chains)
→ Introduced Multi-Threading to test if models can track multiple information threads simultaneously
-----
💡 Key Insights:
→ Most models' effective context limit is shorter than their advertised context length
→ Models perform better with forward-moving threads compared to backward threads
→ Many models are remarkably "thread-safe" - can follow multiple threads without performance loss
→ Different tokenizers count tokens very differently - direct comparisons can be misleading
→ Performance generally decreases towards the middle of the context window
-----
📊 Results:
→ GPT-4 performs best at small contexts, Gemini 1.5 Pro excels at longer contexts
→ Claude 3.5 Sonnet leads in mid-range contexts (2.5k to 32k tokens)
→ Closed-source models consistently outperform open-source alternatives
→ Most models show significant accuracy drop beyond their effective context limit
Share this post