Leveraging math olympiad forum, this paper auto-creates 600K high-quality math QA pairs to train smarter Large Language Models.
And also a contamination-resistant benchmark.
A continuously evolving, timestamped evaluation set from live forum data.
Paper - https://arxiv.org/abs/2501.14275
Original Problem 🤔:
→ Existing datasets for training and evaluating LLMs on Olympiad-level math problems are limited in size and quality.
→ Creating large-scale datasets for advanced math reasoning is labor-intensive and requires expert knowledge.
→ Current benchmarks are susceptible to contamination, making evaluations unreliable due to potential pre-training data overlap.
-----
Solution in this Paper 💡:
→ The paper introduces an automated pipeline to extract question-answer pairs from the Art of Problem Solving (AoPS) forum.
→ This pipeline utilizes open-source LLMs to process and filter forum content, creating a high-quality dataset named AoPS-Instruct.
→ AoPS-Instruct contains over 600,000 question-answer pairs focused on Olympiad-level math problems.
→ The paper also proposes LiveAoPSBench, a continuously evolving evaluation benchmark derived from the latest AoPS forum data with timestamps.
→ LiveAoPSBench acts as a contamination-resistant benchmark because it uses newly generated problems, reducing the likelihood of pre-training data contamination.
-----
Key Insights from this Paper:
→ High-quality, large-scale datasets for advanced math reasoning can be automatically generated from online forums using LLMs.
→ Timestamped, continuously updated benchmarks like LiveAoPSBench are crucial for reliable LLM evaluation and to mitigate contamination issues.
→ LLM performance on older datasets might be inflated due to pre-training data exposure, rather than genuine reasoning ability.
→ Performance decline on newer, contamination-resistant benchmarks suggests the need for more robust reasoning capabilities in LLMs.
-----
Results 📊:
→ Fine-tuning LLMs on AoPS-Instruct improves their reasoning abilities on various math benchmarks.
→ Models fine-tuned on AoPS-Instruct achieve improved performance on benchmarks like MATH and GSM8K.
→ A significant decline in LLM performance is observed when evaluated on LiveAoPSBench over time, highlighting contamination issues in static benchmarks.
Share this post