TurtleBench evaluates LLMs using dynamic, real-world puzzles, focusing on reasoning over knowledge recall.
📚 https://arxiv.org/pdf/2410.05262
Original Problem 🔍:
Existing LLM evaluation benchmarks rely on static datasets, struggle to assess dynamic interactions, and often depend on specific background knowledge, complicating measurement of logical reasoning capabilities.
-----
Solution in this Paper 🧪:
• TurtleBench: A new evaluation benchmark using real user guesses from an online Turtle Soup Puzzle platform
• Collects 1,532 user guesses with correctness annotations
• Focuses on reasoning capabilities without relying on external knowledge
• Provides quantifiable results through clear correct/incorrect judgments
• Dynamically updates evaluation data to reduce the risk of model cheating
-----
Key Insights from this Paper 💡:
• Dynamic data collection aligns evaluations more closely with genuine user needs
• Quantifiable yes/no format enhances reliability of evaluations
• Larger parameter counts don't always correlate with better performance
• 2-shot prompting generally improves performance over 0-shot
• OpenAI's o1 models underperformed, possibly due to trivial Chain-of-Thought techniques
-----
Results 📊:
• Evaluated 9 top LLMs on TurtleBench
• Claude-3.5-Sonnet and GPT-4o performed best, with >87% accuracy
• OpenAI o1 models underperformed expectations
• 2-shot prompting improved performance by ~2% for most models
• Dataset and evaluation code available on GitHub
Share this post