"RedStone: Curating General, Code, Math, and QA Data for Large Language Models"

Playback speed

Share post at current time

0:00

Transcript

"RedStone: Curating General, Code, Math, and QA Data for Large Language Models"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Jan 01, 2025

Common Crawl becomes a goldmine for AI training with RedStone's smart filtering.

RedStone introduces a pipeline to extract high-quality data from Common Crawl for LLM training, creating datasets across general and specialized domains like code, math, and QA. The pipeline processes 3.48 trillion tokens through extraction and filtering modules, significantly improving model performance across various tasks.

-----

https://arxiv.org/abs/2412.03398

🔍 Original Problem:

→ Creating high-quality training datasets for LLMs is expensive and time-consuming, especially for domain-specific knowledge.

→ Existing approaches rely heavily on proprietary resources or manually annotated data, limiting dataset size and scope.

-----

🛠️ Solution in this Paper:

→ RedStone introduces a two-module pipeline: Extraction and Filtering.

→ The Extraction module processes raw web data using pattern recognition and NLP techniques.

→ The Filtering module refines content through keywords, regular expressions, and machine learning models.

→ RedStone creates specialized datasets: RedStone-Web (3.17T tokens), RedStone-Code (250.2B tokens), RedStone-Math (15.9B tokens), and RedStone-QA (51.4B tokens).

-----

💡 Key Insights:

→ Common Crawl contains rich domain-specific knowledge with contextual information

→ Web data often includes valuable annotations and explanations

→ Two-stage filtering (rule-based + model-based) ensures data quality

-----

📊 Results:

→ RedStone-Web outperforms existing datasets in common sense tasks (PIQA: 0.6795, HellaSwag: 0.3722)

→ RedStone-Code improves HumanEval pass@1 from 0.0125 to 0.0555

→ RedStone-Math achieves better perplexity on GSM8k (3.1125) than OpenWebMath (3.2503)