0:00
/
0:00
Transcript

"Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?"

The podcast on this paper is generated with Google's Illuminate.

New tests reveal the true effective context limits of leading LLMs.

→ GPT-4 performs best at small contexts, Gemini 1.5 Pro excels at longer contexts

→ Claude 3.5 Sonnet leads in mid-range contexts (2.5k to 32k tokens)

https://arxiv.org/abs/2411.05000

🎯 Original Problem:

Current benchmarks for evaluating LLMs' long context capabilities are inadequate - they either saturate at perfect scores, test on limited context lengths, or lack granular insights into specific model behaviors.

-----

🔬 Solution in this Paper:

→ Introduced a series of increasingly complex retrieval tasks using synthetic UUID key-value pairs to test 17 leading LLMs

→ Created novel "needle threading" tasks where models must follow chains of linked information through contexts up to 900k tokens

→ Developed tasks like Single Needle (basic retrieval), Multiple Needles (concurrent retrieval), and Threading (following information chains)

→ Introduced Multi-Threading to test if models can track multiple information threads simultaneously

-----

💡 Key Insights:

→ Most models' effective context limit is shorter than their advertised context length

→ Models perform better with forward-moving threads compared to backward threads

→ Many models are remarkably "thread-safe" - can follow multiple threads without performance loss

→ Different tokenizers count tokens very differently - direct comparisons can be misleading

→ Performance generally decreases towards the middle of the context window

-----

📊 Results:

→ GPT-4 performs best at small contexts, Gemini 1.5 Pro excels at longer contexts

→ Claude 3.5 Sonnet leads in mid-range contexts (2.5k to 32k tokens)

→ Closed-source models consistently outperform open-source alternatives

→ Most models show significant accuracy drop beyond their effective context limit

Discussion about this video