New datasets and methods expose true LLM mathematical reasoning.
This paper tackles the issue of contaminated evaluation datasets for Large Language Models (LLMs).
-----
Paper - https://arxiv.org/abs/2501.14002
Original Problem 🧐:
→ Current LLM evaluations might be flawed.
→ Datasets used for evaluation could be present in LLM training data.
→ This contamination can lead to inflated and unreliable performance results.
-----
Solution in this Paper 💡:
→ The paper selects Llama2 as the base LLM. Llama2 was released before OpenWebMath, a potential source of contamination.
→ They utilize new evaluation datasets: GAOKAO and ZHONGKAO. These datasets are derived from recent Chinese college and high school entrance exams.
→ GAOKAO and ZHONGKAO were created after Llama2's release. This significantly reduces the risk of evaluation dataset contamination.
→ MinHash deduplication is applied to training data. This removes duplicate documents and overlapping text.
→ This deduplication process aims to improve training data quality and minimize contamination.
→ Evaluation uses zero-shot and few-shot Chain-of-Thought prompting.
→ An answer comparison model is used to handle inconsistent LLM outputs.
→ The higher accuracy between zero-shot and few-shot results is reported for each dataset. This ensures result robustness.
-----
Key Insights from this Paper 🤔:
→ Dataset contamination poses a significant threat to accurate LLM evaluation.
→ Using evaluation datasets created after the LLM's training cutoff reduces contamination risk.
→ Deduplication and decontamination of training data are crucial for reliable evaluation.
→ Diverse datasets, like GAOKAO and ZHONGKAO, offer a more comprehensive ability assessment.
-----
Results 📊:
→ Deduplication removed 2594 contaminated documents from OpenWebMath. This highlights the substantial impact of decontamination on evaluation.
→ The paper emphasizes methodological rigor to ensure result reliability.
→ The use of newer datasets and decontamination methods strengthens the robustness of LLM evaluation.
Share this post