Gaming becomes a transparent window into how LLMs reason about complex problems.
GAMEBoT introduces a novel benchmark for evaluating LLM reasoning through interactive gaming environments, decomposing complex decisions into verifiable subproblems for enhanced assessment transparency.
-----
https://arxiv.org/abs/2412.13602
🤔 Original Problem:
Current LLM reasoning benchmarks face challenges with interpretability, performance saturation, and data contamination. Existing game-based evaluations only measure final outcomes, missing crucial insights into decision-making processes.
-----
🎮 Solution in this Paper:
→ GAMEBoT breaks down complex game decisions into 2-3 modular subproblems, requiring explicit intermediate reasoning steps.
→ The framework employs strategically-guided Chain-of-Thought prompts infused with domain expertise.
→ Rule-based algorithms automatically validate LLMs' intermediate reasoning against ground truth.
→ Eight diverse games spanning board games, action games, card games, and game theory test different cognitive abilities.
-----
💡 Key Insights:
→ Intermediate step evaluation strongly correlates with final game outcomes
→ Competition-based assessment provides more diverse state exposure than fixed-policy opponents
→ Domain-specific prompting outperforms generic Chain-of-Thought approaches
-----
📊 Results:
→ GPT-4o achieved highest average score of 0.52 in intermediate reasoning tasks
→ Closed-source models consistently outperformed open-source alternatives
→ Even top models struggled with complex reasoning subproblems
Share this post