0:00
/
0:00
Transcript

TURTLEBENCH: EVALUATING TOP LANGUAGE MODELS VIA REAL-WORLD YES/NO PUZZLES

The podcast on this paper is generated with Google's Illuminate.

TurtleBench evaluates LLMs using dynamic, real-world puzzles, focusing on reasoning over knowledge recall.

📚 https://arxiv.org/pdf/2410.05262

Original Problem 🔍:

Existing LLM evaluation benchmarks rely on static datasets, struggle to assess dynamic interactions, and often depend on specific background knowledge, complicating measurement of logical reasoning capabilities.

-----

Solution in this Paper 🧪:

• TurtleBench: A new evaluation benchmark using real user guesses from an online Turtle Soup Puzzle platform

• Collects 1,532 user guesses with correctness annotations

• Focuses on reasoning capabilities without relying on external knowledge

• Provides quantifiable results through clear correct/incorrect judgments

• Dynamically updates evaluation data to reduce the risk of model cheating

-----

Key Insights from this Paper 💡:

• Dynamic data collection aligns evaluations more closely with genuine user needs

• Quantifiable yes/no format enhances reliability of evaluations

• Larger parameter counts don't always correlate with better performance

• 2-shot prompting generally improves performance over 0-shot

• OpenAI's o1 models underperformed, possibly due to trivial Chain-of-Thought techniques

-----

Results 📊:

• Evaluated 9 top LLMs on TurtleBench

• Claude-3.5-Sonnet and GPT-4o performed best, with >87% accuracy

• OpenAI o1 models underperformed expectations

• 2-shot prompting improved performance by ~2% for most models

• Dataset and evaluation code available on GitHub

Discussion about this video