0:00
/
0:00
Transcript

"AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge"

Generated below podcast on this paper with Google's Illuminate.

A framework that ensures LLMs can't cheat by memorizing test data

AntiLeak-Bench prevents data contamination in LLM evaluation by automatically constructing test samples with knowledge updated after model cutoff dates.

https://arxiv.org/abs/2412.13670

🔍 Original Problem:

→ Current LLM evaluation benchmarks suffer from data contamination when test data leaks into newer models' training sets

→ Existing solutions that collect new data can't guarantee contamination-free evaluation and require intensive human effort

-----

🛠️ Solution in this Paper:

→ AntiLeak-Bench identifies real-world knowledge updated after an LLM's cutoff time from Wikidata

→ It automatically constructs question-answer pairs about this updated knowledge using Wikipedia articles as supporting documents

→ The framework supports both single-hop and multi-hop reasoning questions

→ The entire process is fully automated, eliminating human labor in benchmark maintenance

-----

💡 Key Insights:

→ LLMs show performance drops after their cutoff dates, indicating data contamination in pre-cutoff evaluations

→ Multi-choice format testing is significantly easier than free-form generation

→ Longer contexts and multi-hop reasoning pose greater challenges for LLMs

→ Proprietary models outperform open-source ones by large margins

-----

📊 Results:

→ Achieves over 96% accuracy in human verification of generated samples

→ GPT-4o leads with 77.2% EM and 87.9% F1 scores

→ Most open-source models score below 50% EM on generation tasks

------

Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓

🎉 https://rohanpaul.substack.com/