Nemotron-CC transforms Common Crawl into a high-quality, diverse dataset for long-horizon LLM pretraining.
It achieves better accuracy-quantity trade-offs through classifier ensembling, synthetic data generation, and reduced heuristic filtering, enabling state-of-the-art performance on 15T token training.
-----
https://arxiv.org/abs/2412.02595
🔍 Original Problem:
Existing English Common Crawl datasets like FineWeb-Edu and DCLM remove 90% of data, limiting their suitability for long-horizon LLM training.
-----
🛠️ Solution in this Paper:
→ Nemotron-CC employs classifier ensembling to identify high-quality documents from diverse perspectives.
→ It uses synthetic data generation to rephrase low-quality content and create diverse variants of high-quality data.
→ The method reduces reliance on heuristic filters, especially for high-quality content, to preserve more unique tokens.
→ This approach creates a 6.3T token dataset with 4.4T globally deduplicated original tokens and 1.9T synthetic tokens.
-----
💡 Key Insights from this Paper:
→ Ensembling different classifiers increases the recall of high-quality documents
→ Rephrasing can effectively reduce noise in low-quality data and produce diverse variants from high-quality data
→ Disabling traditional heuristic filters for high-quality data can boost token yield without hurting accuracy
-----
📊 Results:
→ 8B parameter model trained on 1T tokens: +5.6 MMLU improvement over DCLM
→ Full dataset: 4x more unique real tokens than DCLM, matching MMLU performance
→ 15T token training: Outperforms Llama 3.1 8B (+5 MMLU, +3.1 ARC-Challenge, +0.5 average across 10 tasks)
Share this post