0:00
/
0:00
Transcript

"What Makes Cryptic Crosswords Challenging for LLMs?"

The podcast on this paper is generated with Google's Illuminate.

LLMs crack only 11% of cryptic crosswords while humans solve 99% - here's why.

This paper investigates why LLMs struggle with cryptic crosswords, analyzing their performance in definition extraction, wordplay recognition, and reasoning processes.

https://arxiv.org/abs/2412.09012

🧩 Original Problem:

→ While LLMs excel in many language tasks, they perform poorly on cryptic crosswords, achieving only 11.4% accuracy compared to human experts' 99%

→ Previous research hasn't deeply analyzed why LLMs struggle with these puzzles

-----

🔍 Solution in this Paper:

→ The researchers evaluated three LLMs (Gemma2, LLaMA3, ChatGPT) on individual cryptic clues rather than complete grids

→ They broke down the solving process into three key components: definition extraction, wordplay type identification, and solution explanation

→ They created a new dataset with annotated wordplay types to analyze model performance

-----

💡 Key Insights:

→ Models perform better at definition extraction (41.2%) than complete puzzle solving (11.4%)

→ LLMs struggle most with complex wordplay operations and character manipulation

→ Double definition clues are easier for models due to similarity with standard language tasks

→ Models tend to over-predict certain wordplay types (anagrams, hidden words) while rarely identifying others (assemblage)

-----

📊 Results:

→ ChatGPT achieved 11.4% accuracy in zero-shot settings

→ Performance improved to 16.2% when given the definition

→ Definition extraction reached 41.2% accuracy

→ Wordplay type detection peaked at 44.5% accuracy

Discussion about this video

User's avatar