New benchmark exposes the true reasoning capabilities of LLMs using dynamic puzzle generation
K&K puzzles, proposed in this paper, reveal how LLMs balance memorization and reasoning in logical problem-solving
📚 https://arxiv.org/abs/2410.23123
🤖 Original Problem:
LLMs show puzzling behavior in reasoning tasks - excellent performance on complex problems but basic mistakes on simple ones. This raises questions about whether they truly reason or just memorize training data.
-----
🔧 Solution in this Paper:
→ Introduces Knights and Knaves (K&K) puzzles as a dynamic benchmark for testing logical reasoning
→ Develops Local Inconsistency-based Memorization Score (LiMem) that measures model performance on original vs perturbed puzzles
→ Creates two key modules:
- Abstract Module: Generates puzzles with specified complexity
- Natural Language Module: Converts abstract puzzles to natural text
→ Implements systematic perturbation tests at both mathematical and linguistic levels
-----
💡 Key Insights:
→ LLMs can simultaneously use memorization and genuine reasoning
→ Fine-tuning improves generalization even as memorization increases
→ Models can develop reasoning skills even when trained only on question-answer pairs
→ More complex puzzles show higher memorization scores
→ Language-level perturbations affect models less than mathematical structure changes
-----
📊 Results:
→ Only advanced LLMs achieve >70% accuracy on 2-person puzzles
→ Performance drops to 11% for 8-person puzzles
→ GPT4o-mini reaches near 100% training accuracy on 3/5-person puzzles
→ LiMem scores ~50% on 8-person puzzles indicate heavy memorization
→ Models show 80% memorization under role-flipping perturbations
Share this post