This paper evaluates the strategic reasoning capabilities of six LLMs using games from behavioral economics.
It measures their reasoning depth and compares performance across games and against human subjects.
https://arxiv.org/abs/2412.13013
Methods in this Paper 💡:
→ The paper tests six LLMs (3 ChatGPT versions, 3 Claude versions) on behavioral economics games (p-Beauty Contest, Guessing Game, and 11-20 Money Request Game).
→ It analyzes their responses and uses established hierarchical models of reasoning (level-k theory and cognitive hierarchy theory) to quantify their strategic sophistication.
→ Multiple rounds of games with feedback are included for learning assessment.
-----
Key Insights from this Paper 🔑:
→ Most LLMs struggle with higher-order strategic reasoning, even with game knowledge.
→ Learning is observed with repeated interactions, but reasoning still falls short of humans, except for GPT-01.
→ GPT-01, trained for complex reasoning, consistently outperforms other LLMs and humans.
-----
Results 💯:
→ GPT-01 demonstrates high-level strategic reasoning across games, while other LLMs are limited.
→ Human subjects exhibit better strategic reasoning, except on certain games where GPT-01 excels.
Share this post