Reflection-Bench: probing AI intelligence with reflection

LLMs struggle with basic human-like reflection abilities

Nov 06, 2024

LLMs struggle with basic human-like reflection abilities

Reflection-Bench tests if LLMs can actually learn from their mistakes like humans do

Original Problem 🤔:

Current benchmarks for evaluating LLM intelligence lack a unified, biologically-inspired framework. Most evaluations focus on isolated capabilities without considering how real intelligence adapts to unexpected outcomes.

Solution in this Paper 🛠️:

• Introduces Reflection-Bench: 7 cognitive tasks adapted from neuroscience to test LLM's reflection abilities

• Tasks evaluate:

Perception (oddball paradigm)
Memory (n-back task)
Belief updating (probabilistic reversal learning)
Decision-making (Wisconsin card sorting)
Prediction (weather prediction)
Counterfactual thinking (double-choice Iowa gambling)
Meta-reflection (meta-bandit task)

• Each task has adjustable difficulty levels for future AI advancement

Key Insights 💡:

• LLMs show basic surprise detection and working memory but struggle with flexible adaptation

• Chain-of-thought improves performance but at 60% higher computational cost

• All tested models completely lack meta-reflection abilities

• Current LLMs fall short of human-level reflection capabilities

Results 📊:

• Evaluated 13 LLMs including OpenAI o1, GPT-4, Claude 3.5

• o1-preview topped most tasks but showed "MMN deficits" in oddball detection

• No model could recognize patterns in meta-bandit task (0 points)

• Performance gap between o1-preview and others significant only in 2-back (100%) and WCST (85.29%)

Rohan's Bytes

Discussion about this post