0:00
/
0:00
Transcript

"Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset"

The podcast on this paper is generated with Google's Illuminate.

Nemotron-CC transforms Common Crawl into a high-quality, diverse dataset for long-horizon LLM pretraining.

It achieves better accuracy-quantity trade-offs through classifier ensembling, synthetic data generation, and reduced heuristic filtering, enabling state-of-the-art performance on 15T token training.

-----

https://arxiv.org/abs/2412.02595

🔍 Original Problem:

Existing English Common Crawl datasets like FineWeb-Edu and DCLM remove 90% of data, limiting their suitability for long-horizon LLM training.

-----

🛠️ Solution in this Paper:

→ Nemotron-CC employs classifier ensembling to identify high-quality documents from diverse perspectives.

→ It uses synthetic data generation to rephrase low-quality content and create diverse variants of high-quality data.

→ The method reduces reliance on heuristic filters, especially for high-quality content, to preserve more unique tokens.

→ This approach creates a 6.3T token dataset with 4.4T globally deduplicated original tokens and 1.9T synthetic tokens.

-----

💡 Key Insights from this Paper:

→ Ensembling different classifiers increases the recall of high-quality documents

→ Rephrasing can effectively reduce noise in low-quality data and produce diverse variants from high-quality data

→ Disabling traditional heuristic filters for high-quality data can boost token yield without hurting accuracy

-----

📊 Results:

→ 8B parameter model trained on 1T tokens: +5.6 MMLU improvement over DCLM

→ Full dataset: 4x more unique real tokens than DCLM, matching MMLU performance

→ 15T token training: Outperforms Llama 3.1 8B (+5 MMLU, +3.1 ARC-Challenge, +0.5 average across 10 tasks)

Discussion about this video