Curating Public Datasets for LLM Pretraining

Apr 20, 2025

Browse all previoiusly published AI Tutorials here.

Table of Contents

📂 Data Collection & Initial Cleaning
🧹 Filtering Noise, PII, and Duplicates
⚖️ Balancing Languages and Domains
🔒 Data Governance: Licensing & Transparency
🛠️ New Curation Tools in 2024–2025
🏢 Open-Source vs Enterprise Pipelines
🤖 Case Studies: Recent LLM Dataset Curation

📂 Data Collection & Initial Cleaning

Gathering a massive text corpus is the first step in pretraining LLMs. In 2024, most projects leverage large-scale web crawls (e.g. Common Crawl dumps) combined with curated sources like Wikipedia, open-access books, code repositories, and academic papers (here), Modern pipelines often start with tens or hundreds of terabytes of raw text (HTML, JSON) and extract clean text content using robust scrapers. For example, the Dolma corpus (by AllenAI) sourced ~200 TB of raw web, code, and scholarly text and curated it down to an 11 TB clean dataset (~3 trillion tokens) , This involves parsing documents (removing HTML tags, boilerplate, and non-text content) and normalizing text encoding. Tools like trafilatura or custom HTML parsers are used to isolate main text from clutter.

Initial cleaning also means discarding extremely short or empty texts, non-UTF8 content, and enforcing language detection. Common practice is to detect the language of each document (e.g. using FastText langID) and remove texts in languages that are not needed or that can’t be recognized (e.g. garbled content). Large open datasets such as RedPajama-Data-v2 (released late 2023) demonstrate these collection methods: it gathered over 100 billion web documents from 84 CommonCrawl snapshots, then filtered and deduplicated them to 30 billion documents (~30 trillion tokens) (RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models), This scale is unprecedented, but it shows that 2024-era LLM projects aim to approach the data quantities used by industry leaders (for reference, Meta’s Llama 2 was trained on 2.4 T tokens of curated data).

Apart from raw web crawling, other public data sources are integrated to boost quality and diversity. These include Wikipedia and other Wikimedia texts, which provide well-structured knowledge; large code corpora (e.g. permissively licensed GitHub code, Stack Exchange Q&A) to imbue models with programming knowledge; public domain books (e.g. Project Gutenberg); and scientific articles (arXiv or Semantic Scholar Open Research). Modern open datasets tend to be multi-source by design. For example, Dolma’s 3T-token corpus draws from six sectors: web pages, code, Reddit discussions, scientific papers, books, and encyclopedic content (here), By combining diverse sources, the training set covers a wide range of styles and domains, which is crucial for a generalist LLM.

🧹 Filtering Noise, PII, and Duplicates

After raw data collection, an extensive filtering pipeline cleanses the corpus. Noise removal targets content that is low-value or nonsensical for language modeling: e.g. HTML boilerplate (“Privacy Policy”, navigation menus), random strings or hashes, or pages with extremely high repetition of the same phrase. Scripts or regex rules can strip boilerplate and junk. Many pipelines also apply perplexity-based filtering, where a smaller language model scores each text; if a document’s perplexity is too high (meaning the text is too random or gibberish), it gets discarded as likely noisy or machine-generated. Recent research emphasizes the effectiveness of model-based filtering: the DataComp-LM benchmark found that using an LLM or classifier to judge text quality yields better downstream models than simple heuristics (DataComp-LM: In search of the next generation of training sets for language models) , In practice, this might involve training a classifier to predict if a document is “high-quality” or not, or using an instruction-tuned model to rate coherence. For example, the FineWeb-Edu project (Hugging Face, 2024) trained a LLaMA-70B-based classifier to score web pages by educational value, then kept only top-scoring documents (oumi.datasets.pretraining — Oumi), This aggressive filtering produced a 1.3 T-token subset of exceptionally high-quality educational text.

Another critical filter is for personal or sensitive information (PII). To responsibly use public data, many pipelines now detect and remove PII like phone numbers, email addresses, physical addresses, or personal names in contexts that look private. This is often done with regex patterns (for phone numbers, SSNs, etc.) and Named Entity Recognition to catch person names. Some teams incorporate dedicated PII detection libraries or models (e.g. Microsoft Presidio or custom classifiers) to scrub out any confidential data. For instance, in the Dolma corpus, the portion derived from software repositories (commits, issue discussions) was processed with PII removal to exclude things like author emails in commit logs. Automatic PII filtering isn’t perfect, but it substantially reduces the chance that a model memorizes someone’s personal details.

Deduplication is another filtering stage that became a must in 2024. Duplicate texts can cause overfitting and disproportionately influence a model if repeated many times. Thus, large datasets undergo dedup at both document and span level. A common approach is hashing each document’s text (or paragraphs) and dropping exact duplicates. More advanced pipelines use fuzzy deduplication with techniques like MinHash or SimHash to catch near-duplicates (e.g. same article with minor differences) (oumi.datasets.pretraining — Oumi) , For example, the Falcon RefinedWeb dataset (built by TII for the Falcon LLM) performed large-scale deduplication on CommonCrawl data, reducing ~1 billion raw pages to 2.8 TB of “clean” unique text. Similarly, DeepSeek’s team noted they “employed advanced deduplication techniques to remove duplicate data across multiple dumps”, maximizing the unique information in their 2 trillion-token dataset (DeepSeek AI: Advancing Open-Source LLMs with MoE & Reinforcement Learning | DeepSeek-R1 & V3 Explained), Deduplication can remove 20–30% of raw web data as redundant, so it’s a massive but necessary cut. The trend in 2025 is to prefer fewer, diverse tokens over sheer volume, since including too many duplicates or boilerplate pages yields diminishing returns.

In summary, filtering pipelines in 2024–2025 include language detection, heuristic cleaning, model-based quality filtering, PII scrubbing, and multi-level deduplication. These steps dramatically improve the signal-to-noise ratio. One study showed that a 7B model trained on a carefully filtered set (DataComp-LM baseline) outperformed models trained on 40% more tokens of unfiltered data (DataComp-LM: In search of the next generation of training sets for language models) , The investment in filtering pays off in better generalization and less unwanted memorization.

⚖️ Balancing Languages and Domains

Ensuring the dataset isn’t overly dominated by a single language or domain is another best practice. Many LLM efforts in 2024 aim for multilingual and multi-domain balance. For multilingual corpora, this means controlling the proportion of each language to achieve desired model competencies. For example, Cohere’s latest Command models explicitly include training data in dozens of languages – English, major European languages, Chinese, Arabic, etc. – to broaden the model’s capabilities (The Command R Model (Details and Application) — Cohere), They include both a core set of ~10 high-resource languages where performance is optimized, and additional languages for basic understanding. To balance languages, one approach is to up-sample underrepresented languages or down-sample English (which naturally dominates web data) so that the final token mix isn’t 99% English. The RedPajama-2 dataset, for instance, extracted five languages (English, French, German, Spanish, Italian) from CommonCrawl with an explicit breakdown (e.g. ~20.5 T English tokens, ~3 T tokens each for de/fr/es) (GitHub - togethercomputer/RedPajama-Data: The RedPajama-Data repository contains code for preparing large datasets for training large language models.), By capping and sampling from each language, it achieved a spread that ensures the resulting model will be truly multilingual.

Domain balancing is about mixing different genres and sources in appropriate ratios. A model should see a healthy blend of conversational text, formal writing, code, literature, etc., rather than, say, 80% forum discussions. Earlier open datasets like The Pile (2021) used fixed percentages from categories (books, GitHub, arXiv, Wikipedia, etc.). In 2024, with giant web-derived corpora, domain balancing can be done by segmenting the data by source or content type and then weighting them. For example, after initial filtering, one might have buckets: news articles, Wikipedia, web forums, legal texts, code, etc. Sampling uniformly from each bucket can prevent any single type from overwhelming the rest. DeepSeek followed a “remixing stage” in data curation to adjust composition and ensure broad representation across different domains (DeepSeek AI: Advancing Open-Source LLMs with MoE & Reinforcement Learning | DeepSeek-R1 & V3 Explained), This likely involved analyzing the token share of each domain and re-weighting sources so that the final training mix covers diverse knowledge areas without heavy bias.

Another aspect is avoiding over-concentration of certain websites or authors. Large web crawls often contain many pages from a few popular sites. Curation pipelines sometimes set a limit on how many documents per domain (hostname) are included, to avoid the model effectively memorizing one site’s style. This keeps the training distribution more generalized.

In multi-domain datasets, coverage matters: including a bit of everything (coding questions, scientific papers, casual social media, etc.) makes the model more versatile. However, 2024 results suggest that simply more data isn’t always better – careful selection yields better performance. Projects like FineWeb-Edu and DCLM chose to drop a huge fraction of data (discarding the lowest-quality 90%) to maximize average quality (here), The art of balancing is to give the model a rich diet of varied text, without diluting quality or skewing too far towards any single content domain or language.

🔒 Data Governance: Licensing & Transparency

Alongside technical filtering, data governance has become a key concern. By 2024, there is intense scrutiny on what data LLMs are trained on, especially regarding copyright and consent. Best practices now push for using openly licensed or public domain data whenever possible (Mozilla, EleutherAI publish research on open datasets for LLM training), For example, EleutherAI and Mozilla convened an Open Dataset initiative to promote training on Creative Commons or otherwise permissive licenses. The Dolma corpus was built under an open-data philosophy: all its sources are accessible to the public (Common Crawl, etc.), and the released dataset is under an Open Data Commons Attribution license (oumi.datasets.pretraining — Oumi), Users of Dolma must still respect original source licenses, but the idea is to ensure the training data is legally shareable and auditable.

Large companies have faced legal challenges for using copyrighted text without permission. Notably, in early 2024 a lawsuit revealed that MosaicML’s MPT models were partly trained on Books3, a pirated ebook collection (Databricks, Inc. Large Language Model Litigation), This led to the removal of Books3 from open access and highlighted the risks of including copyrighted material. Now, open projects avoid non-consensual datasets like Books3. Instead, they lean on alternatives such as Stack Exchange (which is CC BY-SA), GitHub code with permissive licenses, or OpenWebText (Reddit-sourced web content referencing only freely available URLs). Data governance also means implementing opt-out policies. Some web publishers tag content with HTML metadata (like <meta name="robots" content="noai">) to signal disallowing AI training use. Responsible dataset curators respect these signals by filtering out such pages. Likewise, projects may maintain blocklists of websites known to contain mostly copyrighted or private data (for example, removing pages from fanfiction sites or paywalled journals).

Transparency is another pillar of governance. Open LLM efforts in 2024 are far more transparent about their data than earlier models. Teams now publish data documentation or datasheets enumerating the sources and content breakdown. For instance, the Dolma paper provides a detailed table of contents by source (Common Crawl, GitHub, etc.) (here), and Llama 2’s card (2023) described the filtering steps and the proportion of each domain (though not releasing the data itself). Researchers argue that sharing such details helps with accountability and reproducibility ( Towards Best Practices for Open Datasets for LLM Training), Mozilla’s 2025 report calls out the harm caused by secrecy in training data and advocates openness to enable public scrutiny.

Finally, ethical governance includes honoring opt-outs from content creators and providing ways to remove data if legitimate owners object. While this is still evolving, we see a movement toward more consent-based dataset building. Common Voice (Mozilla’s project for speech data) is one example in the audio domain; for text, initiatives are just beginning. In summary, 2024–2025 dataset curators pay much closer attention to legal rights, give attribution (Dolma’s license requires crediting original sources) (oumi.datasets.pretraining — Oumi), and strive to be transparent about what goes into the training mix. This is both to mitigate legal risk and to build public trust in open AI development.

Connect with me on X (Twitter)

🛠️ New Curation Tools in 2024–2025

The scale of data now demands specialized tools and frameworks. One major development was the release of the Dolma Toolkit in 2024 – an open-source, high-performance pipeline for assembling trillion-token datasets (here), This toolkit, from AllenAI’s OLMo project, allows researchers to replicate Dolma’s dataset or build their own, with components for downloading CommonCrawl, filtering, deduping, and analyzing the data. Open infrastructure like this lowers the barrier for others to experiment with data curation at scale.

Another influential framework is DataComp-LM (DCLM), introduced in late 2024 as a controlled benchmark for dataset curation strategies (DataComp-LM: In search of the next generation of training sets for language models), DCLM provides a standardized 240 B-token corpus (a sampled pool from CommonCrawl) and encourages participants to apply different filtering and mixing strategies, then train models to compare outcomes. It essentially treats data selection as a competition – highlighting which curation methods yield the best model performance under equal compute. Early results from DataComp-LM underscore that model-based quality filtering and clever data mixing can beat simple baselines. The DCLM effort also released open-source tools for processing large text (based on the “OpenLM” codebase used for model training and data prep).

In terms of datasets, 2024 saw the public release of several next-generation corpora:

Falcon RefinedWeb (TII/UAE) is a massive web crawl derivative (~1B webpages, ~2.8 TB text) that underwent stringent quality filtering and deduplication (oumi.datasets.pretraining — Oumi), Its creators demonstrated that training solely on filtered web data (no manually curated subsets) can rival more curated mixes in model performance. They even open-sourced a 500 B-token portion for the community (NeurIPS Poster The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only).
RedPajama-Data v2 (Together) as mentioned, scaled up to 30 T tokens and provided 40+ quality annotations per document (RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models) , Those annotations include things like language, a perplexity score, toxicity score, etc., giving researchers rich signals to further filter or weight the data. RedPajama’s codebase (available on GitHub) also popularized the use of CCNet (a pipeline originally by Facebook) to automatically filter CommonCrawl by language and quality.
FineWeb-Edu (Hugging Face) emerged in mid-2024 as a targeted dataset focusing on educational web content. It used a refined classifier to pick out ~1.3 T tokens of high-quality instructional material from the general web. This showcased a workflow for theme-specific dataset curation – a trend that could grow (e.g. curated medical text corpora, etc., using domain-specific filters).

On the enterprise side, companies like MosaicML (now part of Databricks) developed internal tools for streaming data from cloud storage during training, so that multi-terabyte datasets can be used without storing a giant single file. MosaicML’s training platform in 2023–2024 emphasized shuffling data on the fly and reading from object stores, which influenced how academic projects handle data as well. We also see integration of frameworks like Hugging Face Datasets library to manage large pretraining corpora in a memory-efficient way. In fact, many of the new open datasets (Dolma, RedPajama, etc.) are distributed via Hugging Face Datasets, enabling easy streaming and usage by others.

In summary, the toolkit ecosystem is catching up with the scale of LLM data. Open-source contributions such as the Dolma toolkit and DataComp-LM’s recipe are empowering more labs to curate their own custom datasets rather than relying solely on Google-scale resources. This is leading to rapid innovation in data filtering techniques as evidenced by projects like FineWeb-Edu and the continuous updates to RedPajama.

🏢 Open-Source vs Enterprise Pipelines

While the fundamental techniques overlap, there are notable differences in focus between academic/open-source and enterprise data pipelines for LLMs. Open-source projects prioritize transparency and replicability. They publish detailed breakdowns and often release portions (if not all) of the data for others to inspect or use. The emphasis is on using legally shareable content (to avoid takedowns) and building community trust. For instance, open models like Falcon or MPT described their training data mix and tried to avoid sensitive sources (though MosaicML did include Books3 initially, they openly acknowledged it, leading to community discussion and legal action (Databricks, Inc. Large Language Model Litigation)). Open projects are also more willing to experiment with novel data curation methods since they can share outcomes publicly (e.g. FineWeb-Edu’s extreme filtering to boost quality).

Enterprise pipelines, on the other hand, are often larger and may incorporate proprietary data. Companies such as OpenAI, Cohere, or Anthropic have access to subscription-based or private datasets – for example, news archives, professional journals, or usage data (if users opt in). These are rarely shared or even disclosed. Indeed, Mistral AI (a startup) released a strong 7B open model in 2023 but stated “we're unable to share details about the training datasets (extracted from the open web) due to the highly competitive nature of the field (mistralai/Mistral-7B-v0.1 · Training data?), This highlights a common enterprise stance: keeping the exact recipe secret for competitive advantage. Enterprises also put heavy resources into compliance filtering – removing defamatory, explicit, or harmful content aggressively to protect their brand and to align with usage policies. An open project might leave mild profanity in the data, but a corporate model likely filters it out at training or at least handles it via post-processing.

Another difference is scale and diversity: A big company can afford to gather everything (all of CommonCrawl, large licensed databases, etc.) and then filter down, whereas open projects operate with limited compute/storage and carefully select subsets to train on. That said, the gap has narrowed in 2024 with efforts like DeepSeek (an open consortium) reaching multi-trillion token scale ( DeepSeek LLM: Scaling Open-Source Language Models with Longtermism), Enterprises often supplement web data with niche sources: for example, rumors suggest some have licensed book collections or Reddit’s full data directly. They might also use reinforcement learning from human feedback (RLHF) on their data pipeline by having humans label a small subset of data quality to train filters – a luxury of budget. Open groups instead rely on volunteer evaluations or automated signals.

One important contrast is how success is measured. Open data pipelines are evaluated by the community – e.g. via leaderboards like Hugging Face Open LLM rankings or academic benchmarks. Enterprise models are ultimately judged by customers and closed evaluations. This means an enterprise might tweak its data pipeline to optimize specific internal test sets or use-case performance (say, legal documents comprehension), whereas open models often aim for broad benchmark gains. For instance, Cohere’s Command models are tuned for business applications and thus include a lot of business-related text in training (reports, emails, etc.) (The Command R Model (Details and Application) — Cohere), which isn’t explicitly the focus of most open models.

In terms of governance, enterprises face greater legal risk. We saw in 2023–2024 multiple lawsuits against OpenAI, Meta, and others about data usage. Thus, an enterprise will have a legal team vetting the dataset sources, possibly removing anything with an uncertain license. Open projects also care about legality, but they may rely on the umbrella of fair use or research usage in ways a commercial provider cannot. This leads to enterprise datasets possibly excluding some public data due to unclear permission (for example, OpenAI reportedly steers away from certain copyrighted lyrics or code to avoid DMCA issues).

Overall, open-source and academic pipelines trailblaze transparency and innovative curation (e.g., releasing new filtered datasets), whereas enterprises operate with scale and secrecy, focusing on data that gives them a competitive edge in quality while minimizing legal and PR risks. Despite these differences, there’s convergence in recognizing what quality data looks like, thanks to published research. We see companies like Meta publishing papers (e.g. Llama 2’s data card) and open groups pushing scale (DeepSeek, RedPajama), so the lines are blurring as everyone adopts the best practices available.

Connect with me on X (Twitter)

🤖 Case Studies: Recent LLM Dataset Curation

To solidify these concepts, let’s look at a few prominent 2024-era LLM projects and how they curated their training data:

DeepSeek LLM (2024) – DeepSeek is a community-driven initiative that built extremely large open models (up to 67B parameters) by scaling data. They assembled a continually growing dataset, starting with 2 trillion tokens and later expanding towards 8–15 T tokens across many sources (DeepSeek AI: Advancing Open-Source LLMs with MoE & Reinforcement Learning | DeepSeek-R1 & V3 Explained) , DeepSeek emphasized data diversity: their V3 model’s corpus has a rich mix of text types, including a “significant portion of high-quality multilingual data." They also performed heavy deduplication and semantic filtering. According to their technical report, they applied “linguistic and semantic evaluations to maintain dataset quality” – likely using language models to weed out irrelevant or low-information content. They also balanced the data by remixing to fix any domain imbalances. The result was an open dataset (not publicly released in full, but documented) that enabled DeepSeek’s 67B model to slightly outperform Meta’s LLaMA-2 70B on various benchmarks ( DeepSeek LLM: Scaling Open-Source Language Models with Longtermism). This case shows an open team pushing scale (trillions of tokens) while employing state-of-the-art filtering and deduping to keep quality high.
Mistral AI (2023–2024) – Mistral made waves by releasing a top-tier 7B model (Mistral 7B) in late 2023. Their data pipeline was less public, but they stated the model was trained on an unspecified dataset “extracted from the open web” (mistralai/Mistral-7B-v0.1 · Training data?). This implies they likely gathered a large web crawl (possibly similar to RefinedWeb or RedPajama) and filtered it internally. Given the model’s strong performance, it’s believed they did rigorous filtering for quality and possibly included multiple languages (the model shows good multilingual ability). However, as a venture-backed startup, Mistral kept exact details proprietary. By 2024, they expanded to larger “Mistral Large” models, presumably scaling up the token count. Mistral’s approach exemplifies an enterprise-like strategy in a startup: use as much web data as possible, apply advanced filtering (they had key team members with experience in data pipelines), but do not disclose the dataset to maintain a competitive edge. The success of Mistral 7B indicated that a well-filtered open-web dataset (no proprietary data) can rival models trained on secret corpora. It also highlighted the community’s desire for transparency – many asked for their data details, but Mistral explicitly declined to share due to competitive reasons.
I write everyday for my readers on actionable AI. Subscribe and instantly get a 1300+ page Python book.
Cohere (Command Models) – Cohere’s Command series (especially Command R 2024) are examples of enterprise-curated datasets tailored for commercial use. Cohere has said that Command is “trained on a massive corpus of diverse texts in multiple languages (The Command R Model (Details and Application) — Cohere), They focus on quality sources that cover a broad range of tasks. Likely, Cohere’s team curates data from the web (similar to others) but also adds a lot of in-house data: for example, well-moderated Q&A pairs, technical documentation, and other high-value text that suits business applications. They also likely filter out content that violates safety standards more strictly than open projects. The multilingual aspect is strong – as shown by the list of languages in their documentation – meaning they put effort into collecting non-English text (possibly crawling non-English websites or using translated texts). While specifics aren’t public, Cohere’s case demonstrates an enterprise pipeline optimizing for certain domains (e.g. professional documents) and ensuring the model performs safely out-of-the-box for clients. It underscores how the goal of the model (general chatbot vs. domain expert) can influence dataset makeup. An enterprise might overweight legal or medical text if targeting those industries, whereas open models aim for broad general knowledge.
MosaicML (MPT) – MosaicML’s open models (MPT-7B, MPT-30B) were trained on what they called a “Mosaic-curated mix” of data including the RedPajama base and additional sources (Databricks, Inc. Large Language Model Litigation). Essentially, they started with open data (the recreate of LLaMA’s training set via RedPajama, which has CommonCrawl, Wikipedia, etc.) and augmented it. Mosaic focused on high-quality subsets, but as noted earlier, this included the problematic Books3 subset (under RedPajama’s “Books” component). After legal action, MosaicML (now Databricks) likely adjusted their pipeline to exclude such content moving forward. Mosaic also pioneered the streaming data loader, allowing them to use extremely large datasets (their 1T-token training run for MPT-30B) efficiently on their cloud infrastructure. They did share a table for MPT-30B’s data composition, acknowledging each source and its approximate size, showing an open approach to documentation. Mosaic’s case is instructive: it straddles the line between open and enterprise. They open-sourced their models and wanted an open data ethos, but being a company, they faced legal scrutiny and had to navigate the realities of copyrighted material. The takeaway is that even well-meaning efforts must rigorously vet data sources – which Mosaic and others now do by excluding anything not clearly licensed.
Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post