A framework that ensures LLMs can't cheat by memorizing test data
AntiLeak-Bench prevents data contamination in LLM evaluation by automatically constructing test samples with knowledge updated after model cutoff dates.
https://arxiv.org/abs/2412.13670
🔍 Original Problem:
→ Current LLM evaluation benchmarks suffer from data contamination when test data leaks into newer models' training sets
→ Existing solutions that collect new data can't guarantee contamination-free evaluation and require intensive human effort
-----
🛠️ Solution in this Paper:
→ AntiLeak-Bench identifies real-world knowledge updated after an LLM's cutoff time from Wikidata
→ It automatically constructs question-answer pairs about this updated knowledge using Wikipedia articles as supporting documents
→ The framework supports both single-hop and multi-hop reasoning questions
→ The entire process is fully automated, eliminating human labor in benchmark maintenance
-----
💡 Key Insights:
→ LLMs show performance drops after their cutoff dates, indicating data contamination in pre-cutoff evaluations
→ Multi-choice format testing is significantly easier than free-form generation
→ Longer contexts and multi-hop reasoning pose greater challenges for LLMs
→ Proprietary models outperform open-source ones by large margins
-----
📊 Results:
→ Achieves over 96% accuracy in human verification of generated samples
→ GPT-4o leads with 77.2% EM and 87.9% F1 scores
→ Most open-source models score below 50% EM on generation tasks
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post