SimpleQA tests if LLMs can answer basic facts without hallucinating
https://arxiv.org/abs/2411.04368v1
🎯 Original Problem:
Current hallucinations detection benchmarks are either too easy for modern models or too complex to evaluate accurately.
-----
🔧 Solution in this Paper:
→ SimpleQA benchmark with 4,326 short fact-seeking questions, adversarially collected against GPT-4
→ Questions designed to have single, indisputable answers that won't change over time
→ Two-stage verification process with independent AI trainers to ensure answer correctness
→ Questions cover diverse topics like science, politics, art, and technology
→ Automated grading system using ChatGPT classifier for correct, incorrect, or not attempted responses
-----
💡 Key Insights:
→ Larger models show better calibration between stated confidence and actual accuracy
→ Models consistently overstate their confidence in answers
→ Answer frequency correlates with accuracy, especially in advanced models
→ Claude models attempt fewer questions but maintain similar F-scores to GPT models
-----
📊 Results:
→ OpenAI o1-preview achieves highest score of 42.7% correct answers
→ GPT-4o scores 38.2% correct with only 1% not attempted
→ Claude-3-opus reaches 23.5% correct with 39.6% not attempted
→ Error rate estimated around 3% in benchmark questions
Share this post