POKERBENCH provides a new benchmark for teaching LLMs optimal poker strategies, with 11,000 carefully curated scenarios testing complex decision-making abilities in uncertain conditions.
https://arxiv.org/abs/2501.08328
Methods in this Paper 💡:
→ POKERBENCH introduces a comprehensive benchmark with 11,000 poker scenarios split between pre-flop and post-flop play.
→ The scenarios are carefully filtered from billions of possible game states using board texture analysis and optimal action probability thresholds.
→ The benchmark evaluates both action accuracy (fold/call/raise decisions) and exact match accuracy (precise bet sizing).
→ POKERBENCH is validated through extensive gameplay testing between models with different benchmark scores.
-----
Key Insights 🔑:
→ All current LLMs significantly underperform at poker compared to their capabilities in other domains
→ Fine-tuning on POKERBENCH dramatically improves poker performance
→ Higher POKERBENCH scores correlate strongly with better gameplay results
→ Simple supervised learning alone may be insufficient for optimal poker strategy
-----
Results 📊:
→ GPT-4: 53.55% overall accuracy (best among pre-trained models)
→ Fine-tuned Llama-3-8B: 78.26% accuracy, outperforming GPT-4
→ Win rate correlation: Models with higher POKERBENCH scores consistently beat lower-scoring models
Share this post