LLMs crack only 11% of cryptic crosswords while humans solve 99% - here's why.
This paper investigates why LLMs struggle with cryptic crosswords, analyzing their performance in definition extraction, wordplay recognition, and reasoning processes.
https://arxiv.org/abs/2412.09012
🧩 Original Problem:
→ While LLMs excel in many language tasks, they perform poorly on cryptic crosswords, achieving only 11.4% accuracy compared to human experts' 99%
→ Previous research hasn't deeply analyzed why LLMs struggle with these puzzles
-----
🔍 Solution in this Paper:
→ The researchers evaluated three LLMs (Gemma2, LLaMA3, ChatGPT) on individual cryptic clues rather than complete grids
→ They broke down the solving process into three key components: definition extraction, wordplay type identification, and solution explanation
→ They created a new dataset with annotated wordplay types to analyze model performance
-----
💡 Key Insights:
→ Models perform better at definition extraction (41.2%) than complete puzzle solving (11.4%)
→ LLMs struggle most with complex wordplay operations and character manipulation
→ Double definition clues are easier for models due to similarity with standard language tasks
→ Models tend to over-predict certain wordplay types (anagrams, hidden words) while rarely identifying others (assemblage)
-----
📊 Results:
→ ChatGPT achieved 11.4% accuracy in zero-shot settings
→ Performance improved to 16.2% when given the definition
→ Definition extraction reached 41.2% accuracy
→ Wordplay type detection peaked at 44.5% accuracy
Share this post