NYT-CONNECTIONS puzzles exposes the gap between pattern matching and true reasoning in modern LLMs.
This paper introduces NYT-CONNECTIONS, a benchmark of 358 word classification puzzles that tests LLMs' ability to perform deliberate System-2 thinking rather than quick intuitive responses.
The benchmark reveals significant performance gaps between humans and LLMs, highlighting limitations in current AI reasoning capabilities.
-----
https://arxiv.org/abs/2412.01621
🤔 Original Problem:
Current benchmarks fail to isolate specific cognitive abilities in LLMs, often allowing models to exploit shortcuts rather than demonstrate true reasoning capabilities. Many tasks combine multiple cognitive processes, making it hard to evaluate distinct abilities.
-----
🔬 Solution in this Paper:
→ The paper introduces NYT-CONNECTIONS, derived from the New York Times Connections game, requiring grouping 16 words into 4 sets of related terms.
→ The benchmark deliberately tempts incorrect System-1 responses while requiring System-2 thinking for correct solutions.
→ The evaluation tests six recent LLMs across three configurations: single-attempt, multiple attempts without hints, and multiple attempts with contextual hints.
→ The benchmark uniquely combines linguistic isolation, resistance to intuitive shortcuts, and regular updates to prevent data leakage.
-----
💡 Key Insights:
→ Even top LLMs like GPT-4 struggle with tasks requiring deliberate reasoning over pattern matching
→ Advanced prompting techniques show diminishing returns as task difficulty increases
→ Contextual hints benefit human performance but show limited impact on LLM performance
→ Simple heuristic approaches achieve comparable performance to some advanced LLMs
-----
📊 Results:
→ Top-performing LLMs fall short of human performance by nearly 30%
→ GPT-4 achieves only 35.5% accuracy without hints, compared to humans' 56%
→ Chain-of-Thought and Self-Consistency show limited improvement on harder puzzles
Share this post