0:00
/
0:00
Transcript

"On Memorization of Large Language Models in Logical Reasoning"

The podcast on this paper is generated with Google's Illuminate.

New benchmark exposes the true reasoning capabilities of LLMs using dynamic puzzle generation

K&K puzzles, proposed in this paper, reveal how LLMs balance memorization and reasoning in logical problem-solving

📚 https://arxiv.org/abs/2410.23123

🤖 Original Problem:

LLMs show puzzling behavior in reasoning tasks - excellent performance on complex problems but basic mistakes on simple ones. This raises questions about whether they truly reason or just memorize training data.

-----

🔧 Solution in this Paper:

→ Introduces Knights and Knaves (K&K) puzzles as a dynamic benchmark for testing logical reasoning

→ Develops Local Inconsistency-based Memorization Score (LiMem) that measures model performance on original vs perturbed puzzles

→ Creates two key modules:

- Abstract Module: Generates puzzles with specified complexity

- Natural Language Module: Converts abstract puzzles to natural text

→ Implements systematic perturbation tests at both mathematical and linguistic levels

-----

💡 Key Insights:

→ LLMs can simultaneously use memorization and genuine reasoning

→ Fine-tuning improves generalization even as memorization increases

→ Models can develop reasoning skills even when trained only on question-answer pairs

→ More complex puzzles show higher memorization scores

→ Language-level perturbations affect models less than mathematical structure changes

-----

📊 Results:

→ Only advanced LLMs achieve >70% accuracy on 2-person puzzles

→ Performance drops to 11% for 8-person puzzles

→ GPT4o-mini reaches near 100% training accuracy on 3/5-person puzzles

→ LiMem scores ~50% on 8-person puzzles indicate heavy memorization

→ Models show 80% memorization under role-flipping perturbations

Discussion about this video