0:00
/
0:00
Transcript

"Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation"

Below podcast is generated with Google's Illuminate.

Leveraging math olympiad forum, this paper auto-creates 600K high-quality math QA pairs to train smarter Large Language Models.

And also a contamination-resistant benchmark.

A continuously evolving, timestamped evaluation set from live forum data.

Paper - https://arxiv.org/abs/2501.14275

Original Problem 🤔:

→ Existing datasets for training and evaluating LLMs on Olympiad-level math problems are limited in size and quality.

→ Creating large-scale datasets for advanced math reasoning is labor-intensive and requires expert knowledge.

→ Current benchmarks are susceptible to contamination, making evaluations unreliable due to potential pre-training data overlap.

-----

Solution in this Paper 💡:

→ The paper introduces an automated pipeline to extract question-answer pairs from the Art of Problem Solving (AoPS) forum.

→ This pipeline utilizes open-source LLMs to process and filter forum content, creating a high-quality dataset named AoPS-Instruct.

→ AoPS-Instruct contains over 600,000 question-answer pairs focused on Olympiad-level math problems.

→ The paper also proposes LiveAoPSBench, a continuously evolving evaluation benchmark derived from the latest AoPS forum data with timestamps.

→ LiveAoPSBench acts as a contamination-resistant benchmark because it uses newly generated problems, reducing the likelihood of pre-training data contamination.

-----

Key Insights from this Paper:

→ High-quality, large-scale datasets for advanced math reasoning can be automatically generated from online forums using LLMs.

→ Timestamped, continuously updated benchmarks like LiveAoPSBench are crucial for reliable LLM evaluation and to mitigate contamination issues.

→ LLM performance on older datasets might be inflated due to pre-training data exposure, rather than genuine reasoning ability.

→ Performance decline on newer, contamination-resistant benchmarks suggests the need for more robust reasoning capabilities in LLMs.

-----

Results 📊:

→ Fine-tuning LLMs on AoPS-Instruct improves their reasoning abilities on various math benchmarks.

→ Models fine-tuned on AoPS-Instruct achieve improved performance on benchmarks like MATH and GSM8K.

→ A significant decline in LLM performance is observed when evaluated on LiveAoPSBench over time, highlighting contamination issues in static benchmarks.

Discussion about this video