Selecting and Preparing Training Data for LLMs (2024–2025)

Jun 14, 2025

Browse all previously published AI Tutorials here.

Selecting and Preparing Training Data for LLMs 2024-2025
Data Diversity and Coverage
Quality Filtering and Data Cleaning
Dataset Size Scale vs Quality vs Efficiency
Data Sources Structured Semi-Structured and Unstructured
Computational Trade-offs and Efficiency
Domain-Specific LLMs Special Considerations
Conclusion

Large Language Models (LLMs) like GPT-4 and Llama owe much of their prowess to the data they are trained on. Recent research and industry reports highlight that high-quality, diverse training data is as crucial as model architecture or hardware. In fact, “high-quality data can accelerate training faster than expensive hardware upgrades and more effectively than better algorithms” (Rotational Labs | Recapping PyTorch: Key Takeaways from the 2024 Conference(https://rotational.io/blog/pytorch-conference-2024/#:~:text=One of the most consistent,more effectively than better algorithms)). This literature review summarizes key 2024–2025 insights on curating LLM training data, covering general-purpose models and briefly touching on domain-specific (medical, legal) models. Key themes include data diversity, quality filtering, dataset scale, source types, and compute trade-offs, with practical techniques and real-world considerations.

Data Diversity and Coverage

Ensuring diverse and heterogeneous data sources is vital for building robust, generalized LLMs. Models trained on a wide variety of text styles, topics, and languages tend to generalize better and handle a broader range of inputs. However, many current datasets are skewed – “voice and text datasets powering AI dramatically under-represent 99%+ of global languages, dialects, and many marginalized communities” (Best Practices for Open Datasets for LLM Training Draft - Dataset Convening 2024). To serve global applications, training data must span multiple languages, dialects, and viewpoints, rather than concentrating on English or Western-centric content . For example, open initiatives emphasize adding more non-English web text, local literature, and cultural content to avoid blind spots.

Diversity is not only about language, but also content domains and formats. Modern LLMs ingest everything from formal encyclopedic text to casual social media chatter, from code snippets to dialogue transcripts. This heterogeneity exposes models to varied linguistic patterns (narrative, argumentative, conversational, etc.) and knowledge domains (science, law, everyday life). Recent research confirms that greater data diversity directly improves model performance. One 2024 study introduced a metric for synthetic data diversity and found “increased diversity correlates positively with model performance, particularly in downstream fine-tuning tasks” (On the Diversity of Synthetic Data and its Impact on Training Large Language Models). In other words, models trained on a more diverse corpus achieved better results on benchmarks. These findings echo the industry practice of mixing sources (e.g. internet forums, news, code, Wikipedia) to produce more robust, general-purpose LLMs.

Practical implications: When assembling training data, aim for a balance of sources: web crawl data for breadth, high-quality reference texts (Wikipedia, books) for factual coverage, code and technical data for reasoning, conversational data for dialog abilities, and multilingual content for global reach. Diverse data acts as a regularizer, reducing overfitting to any single style and improving the model’s ability to handle unexpected queries.

Quality Filtering and Data Cleaning

Raw data “in the wild” is dirty and varied – it contains duplicates, spam, offensive content, private information, and other noise. Quality filtering is therefore a critical stage in LLM data preparation, involving multiple techniques to curate a cleaner, safer dataset:

Deduplication: Removing or down-weighting duplicate and highly redundant texts helps prevent a model from over-memorizing or overweighting those patterns. Duplicate content can come from web crawls that index the same page multiple times or repeated boilerplate text. A common approach is exact or fuzzy matching to drop duplicates. For instance, the GneissWeb dataset (IBM 2024) applied “sharded exact sub-string deduplication” as a first step in processing 10 trillion tokens (GneissWeb: Preparing High Quality Data for LLMs at Scale). New research also proposes softer deduplication methods: SoftDedup (2024) avoids outright deletion and instead reweights recurring data to preserve information while reducing redundancy. By lowering the sampling probability of highly common passages, SoftDedup achieved the same perplexity with 26% fewer training steps and even boosted downstream accuracy by ~1.8% (SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training) . This suggests deduplication not only improves quality but can significantly speed up training.
Toxicity and Harmful Content Filtering: Given LLMs learn from whatever text they see, it’s important to filter out hate speech, extreme profanity, harassment, or other toxic content that we don’t want the model to imitate. Many teams use automatic toxicity classifiers or blocklists to remove such data. However, simplistic filters can backfire – e.g. a naive profanity list might flag and drop medical articles about anatomy or LGBTQ discussions, thus “filtering out non-toxic content” inadvertently (Best Practices for Open Datasets for LLM Training Draft - Dataset Convening 2024). Modern pipelines favor nuanced filtering: open-source efforts like Toxicity of the Commons (2024) use multi-level classifiers that label content by severity and category (e.g. racism, sexism, violence) (Toxicity of the Commons: Curating Open-Source Pre-Training Data). Egregiously toxic samples can even be transformed rather than thrown away – for example, re-writing highly toxic passages in a “detoxified” manner to retain their linguistic context without the harmful language . Industry also pushes for efficient filters: IBM’s 2024 release of Granite HAP filters is a lightweight 38M model that runs in realtime to catch Hate/Abuse/Profanity in training data. “IBM’s new filter is small enough to run on a CPU, enabling data checks at each phase of the LLM pipeline,” making it practical to “ensure no toxic language slips through” (IBM open sources fast HAP filter on Hugging Face - IBM Research). In short, robust toxicity filtering (potentially with content warnings or substitutions for borderline cases) is now standard to align LLMs with ethical and legal norms.
Privacy and PII Removal: Responsible datasets should excise personal identifiable information (PII) like people’s names, emails, phone numbers, addresses, etc., unless it’s from a public domain source intended for inclusion. Techniques include regex and classifier-based detection of PII. For instance, the FineWeb and Dolma projects demonstrate removing emails and phone numbers at scale; FineWeb provided a reproducible “FineWeb Edu Classifier” tool for data cleaning that filters such identifiers . Ensuring privacy not only mitigates legal risk but also prevents LLMs from memorizing and regurgitating private data.
Heuristic Quality Filters: Beyond the above, data engineers apply a host of rules to cull low-quality content. An AWS engineering blog (2024) suggests filtering out documents by various signals (An introduction to preparing your own dataset for LLM training | AWS Machine Learning Blog): e.g. drop texts with suspicious metadata or URLs (indicative of spam/phishing sites), remove content with gibberish or repetitive character patterns, exclude documents that are too short or too long, and enforce language filters (only keep languages relevant to the model’s goals). It’s common to remove machine-generated or translated text that can inflate corpus size without adding true diversity. Many pipelines also compute simple scores like average word length or sentence complexity, flagging outliers (e.g. pages full of random numbers or a single sentence repeated). By combining multiple filtering passes, one can substantially refine a raw crawl into a high-quality corpus. For example, the FineWeb team systematically tested various filtering rules by training small models on samples with/without each filter, and retained only the rules that yielded performance gains (Noteworthy LLM Research Papers of 2024).
Regulatory Compliance: A newer aspect of “quality” is legal compliance of the data. The past year saw increased scrutiny on copyright and consent in training data. Many LLMs have been trained on scraped internet text without permission, triggering lawsuits from content creators (Towards Best Practices for Open Datasets for LLM Training - Mozilla Foundation). As a result, organizations are now cautious about including copyrighted material or private user data. Datasets composed of public domain or properly licensed data are gaining attention. Recent guidelines call for “interoperable standards for data governance” to let content creators opt out their data and to document what goes into models (Best Practices for Open Datasets for LLM Training Draft - Dataset Convening 2024) . In practice, this means curators might exclude certain sources known to be copyrighted (e.g. news articles behind paywalls) or implement removal requests. While this can slightly limit data volume, it improves transparency and avoids future legal hurdles. For instance, Mozilla’s 2024 convening on open LLM datasets recommends working with communities like Wikipedia, Creative Commons, and libraries to source data ethically and honor removal requests if needed .

Overall, quality filtering is about balancing thorough cleaning with caution – eliminating truly harmful or useless data, but not pruning so aggressively that valuable information is lost. Open data best-practice reports stress that “the goal is not a ‘perfect’ dataset, but developing standards that let data contributors declare preferences and report issues”, enabling continuous refinement . Every filtering choice introduces some bias, so transparency in what was removed is crucial for reproducibility and fairness.

Dataset Size: Scale vs. Quality vs. Efficiency

How much data is enough for training an LLM? This is a central question with nuanced trade-offs. On one hand, more data tends to yield better models, following the now-classic scaling law paradigm. On the other hand, data quantity must be weighed against quality and the computational cost of training on billions of tokens.

Scaling laws provide a rule-of-thumb: for a given model size and training compute budget, there is an optimal amount of data to use. The Chinchilla compute-optimal law (DeepMind 2022) suggested roughly a 20:1 ratio of training tokens to model parameters (Noteworthy LLM Research Papers of 2024). Deviating from this “Chinchilla-optimal” ratio means either under-training a large model (if data is insufficient) or over-training on an overly large corpus with diminishing returns. For example, 360 billion tokens might sound huge, but according to Chinchilla it’s only appropriate for a ~1.7B parameter model; a model like GPT-3 (175B params) needed trillions of tokens to fully utilize its capacity . Recent open datasets illustrate this scaling: FineWeb’s 15-trillion-token corpus is theoretically optimal for models up to ~500B parameters by these laws .

In practice, today’s leading models often exceed these prescriptions when extra compute is available. Meta’s Llama 3 models (2024) were reportedly trained on 15T tokens even for a 8B–70B parameter range, overshooting the 20:1 ratio (GneissWeb: Preparing High Quality Data for LLMs at Scale). Why use more data than the “optimal” amount? Because high-quality data can still improve performance even past the knee of the curve, and companies with massive compute are willing to push for every bit of gain. That said, returns diminish – beyond a certain scale, adding more low-quality text might not help and could even hurt by introducing more noise than signal.

This leads to the strategy many have adopted: two-stage or multi-stage training. In a first stage, train on a very large corpus (to maximize coverage of knowledge and linguistics) potentially including some lower-quality material; in a second stage, continue training on a smaller, cleaner dataset to polish the model’s abilities. This approach was noted in IBM’s GneissWeb report: “Stage-1: train on a very large corpus to cover breadth, followed by Stage-2 on much higher-quality but smaller data to further improve the model” . For instance, an LLM might ingest 10T tokens of general web crawl, then be refined on 0.5T tokens of curated books, Wikipedia, and verified content. This yields a model that has both breadth and depth. The Stage-1 ensures the model has seen a bit of everything; Stage-2 ensures its final weights focus on the most reliable information (improving factual accuracy, reasoning, etc.). Many closed-model pipelines likely do something similar (though details are proprietary).

Bigger is not always better – what matters is the combination of scale and quality. A salient example: the open RedPajama dataset contains ~20T tokens (one of the largest publicly) but was only lightly filtered, whereas FineWeb has 15T tokens with heavy curation. The FineWeb paper found that models trained on RedPajama achieved lower quality than those on the smaller FineWeb, due to the difference in filtering (Noteworthy LLM Research Papers of 2024). In other words, an extra 5 trillion tokens of noisy data did not help and even hurt performance compared to a slightly smaller but cleaner dataset. This underscores the trade-off between quantity and quality: beyond a point, dumping more data in the mix can introduce enough garbage or redundancy that it slows learning or confuses the model. The best results come from finding the sweet spot where the dataset is large and predominantly high-quality. Techniques like those mentioned (aggressive de-duplication, ensemble filtering pipelines) are employed to allow scaling up the token count without proportionally increasing noise.

From a compute efficiency standpoint, dataset size directly impacts training cost. Training on 1 trillion tokens might take tens of thousands of GPU-hours; jumping to 10+ trillion can be prohibitive for all but a few organizations. Therefore, curators consider how to maximize utility per token. One trend is “data mixing” where a core of high-quality data is repeated or sampled more often, and lower-quality long-tail data is sampled less frequently. This effectively gives more weight to good data without discarding the long tail entirely. Another trend is using synthetic data (generated by smaller models) to augment scarce types of data, though ensuring diversity in such synthetic additions is crucial (On the Diversity of Synthetic Data and its Impact on Training Large Language Models) . The bottom line is that there’s an economic trade-off: at some point it may be better to stop adding data and save the compute for model refinement or for training a bigger model on the existing data. Research in late 2024 on “Precision Scaling Laws” even suggests new factors to consider, like training at lower numerical precision to get more effective updates per FLOP . All these factors feed into deciding the optimal dataset size.

In summary, larger datasets generally improve LLMs, but only if the data is sufficiently rich. Careful curation and following scaling law insights (like Chinchilla’s ratio) help avoid wasting effort. Many 2024-era models use on the order of trillions of tokens, with open efforts converging around the 10–20T mark for top-tier models (GneissWeb: Preparing High Quality Data for LLMs at Scale). This is approaching the total pool of high-quality text available publicly, raising questions about how to gather even more data (or whether to generate it). For most use cases, focusing on a moderately sized but clean dataset yields a better accuracy/compute trade-off than indiscriminately scraping everything.

Data Sources: Structured, Semi-Structured, and Unstructured

Training data for LLMs can come from a huge range of source types, broadly categorized by how structured or “clean” the text is:

Structured sources – These include databases, knowledge graphs, and other highly organized information. In LLM training context, structured data is often converted to text form. A prime example is Wikipedia: it has a consistent structure (articles, sections, infoboxes) and undergoes human curation, making it a reliable textual resource. Likewise, transcripts of parliamentary debates or legal codes are structured by nature and provide high-quality language data in specific styles. LLM training can also leverage structured Q&A pairs, dictionaries, or tables by linearizing them into text. While purely structured databases (like SQL tables) aren’t directly ingested, their contents can be turned into natural language statements for training. Including such fact-dense, structured text helps models acquire knowledge that is less present in the free-form web. (E.g. Wikipedia is a staple in many LLM datasets because it offers broad coverage of entities and events in a relatively uniform, verified manner.)
Semi-structured sources – This category covers most web content, documents, and books. The text is largely natural language, but often embedded in formatting or metadata. HTML web pages are a prime example: the raw crawl includes HTML tags, navigation menus, etc. that need parsing. An LLM data pipeline must extract the main text and discard boilerplate (An introduction to preparing your own dataset for LLM training | AWS Machine Learning Blog). PDFs or Word documents similarly contain layout or images around the text. The AWS data prep blog describes using OCR for scanned PDFs and specialized parsers for HTML or Office files, then cleaning out non-textual elements and encoding issues . Books (which may be in EPUB/PDF) are semi-structured as well – they contain chapters and formatting, but the narrative text is the key part. Research papers (from arXiv or others) have LaTeX markup that needs removal to get plain text. Semi-structured data thus requires preprocessing to yield clean text paragraphs. Many pipelines use open-source tools (BeautifulSoup for HTML, Apache Tika or PDFMiner for PDFs, etc.) to handle this at scale. These sources are extremely valuable because they comprise the majority of written knowledge online (news articles, blog posts, literature, etc.). Projects like Common Crawl gather billions of web pages; from these, filters (like those in C4 or RefinedWeb) select the ones that look like substantive text and strip away HTML to feed LLM training (Noteworthy LLM Research Papers of 2024). In 2024, datasets like Dolma (Allen AI’s open corpus for OLMo) and Common Corpus (EleutherAI’s initiative) combined web crawl text with other semi-structured corpora like books and periodicals (Toxicity of the Commons: Curating Open-Source Pre-Training Data) . Including semi-structured sources ensures models learn a mix of formal and informal registers, as well as domain-specific jargon present in news, scientific papers, etc.
Unstructured sources – This refers to text that is highly informal, irregular, or conversational, such as social media posts, discussion forums, chat logs, and raw transcripts of speech. Such data often lacks the polish of edited text: it may contain typos, slang, inconsistent grammar, and use of memes or context-dependent language. While messy, unstructured data is a goldmine for modeling how real people communicate. Reddit discussions, Twitter feeds, and public chat logs have been used (carefully filtered) to inject a conversational tone and knowledge of colloquialisms into LLMs. For example, OpenAI’s earlier GPT models famously used a WebText dataset that included Reddit links, capturing a lot of informal dialogue. Unstructured data can also include user-generated Q&A (like StackExchange posts for technical domains) or customer support logs (for enterprise models), etc. These sources help an LLM not sound overly “academic” and can improve its ability to follow instructions or answer questions in a human-like way. The challenge is that unstructured sources require heavy cleaning – removing profanity, HTML entities, or irrelevant content. They might also need annotation (e.g. who is the speaker) if used for dialogue modeling. Still, incorporating some proportion of raw, unstructured text gives the model exposure to the noise and richness of human language as it is used in practice.

Crucially, a blend of all the above source types yields the best general-purpose model. Each source contributes something: structured data contributes accuracy and factual grounding; semi-structured contributes breadth of knowledge and literate writing; unstructured contributes conversational ability and cultural references. Industry best practices emphasize maximizing source diversity – for example, open dataset compendiums often combine Common Crawl (unstructured web) with a curation of “high-quality” sources like GitHub code, Wikipedia, books, news articles, and journals (GneissWeb: Preparing High Quality Data for LLMs at Scale). The Common Crawl provides sheer volume and breadth, while curated sources act as a quality backbone. In the Pile dataset (an open 22-component corpus from EleutherAI), a mixture included everything from Wikipedia and PubMed articles to Reddit threads, user fiction, and hacker news comments, deliberately spanning different styles. This heterogeneity helps the resulting model handle inputs ranging from “Explain quantum physics” to “Why won’t my code compile? (with code snippet)” to “Any good jokes about robots?”.

Preparation techniques: Depending on source, different extraction methods apply (as noted, OCR for scanned text, HTML parsing for web, etc.). After extraction, normalization is done: lowercasing (sometimes), Unicode normalization, and removal of odd characters or scripts not intended (e.g. filtering out binary data or huge chunks of code if not desired). Language identification is run on many pipelines to sort text by language and ensure each is handled appropriately (keeping an intended set, dropping others or shunting them to their own corpus). The AWS blog outlined this multi-source ingestion: “read data from multiple sources, extract text using OCR for PDFs, HTML parsers for web documents, ... remove non-textual elements” (An introduction to preparing your own dataset for LLM training | AWS Machine Learning Blog). Once cleaned to plain text, all sources can be concatenated into one giant training corpus or structured into segments (some projects maintain separate “domains” and sample from each with set ratios). In sum, the pipeline must unify structured, semi-structured, and unstructured inputs into a consistent format (usually one sentence per line or similar) ready for tokenization.

By leveraging a rich variety of data sources and formats, LLM developers ensure that the model isn’t narrow or brittle. Diversity in data sources future-proofs the model to some extent – it will have seen multiple ways information can be presented (a Wikipedia summary, a bullet list, a tweet, a code comment, etc.), so it can adapt its generation or comprehension to the style appropriate for a task.

Computational Trade-offs and Efficiency

Selecting and curating data is tightly coupled with computational considerations. The goal is not just to maximize model quality, but to do so within feasible training time and to yield an efficient model at inference. Several trade-offs come into play:

Training Efficiency: The more data you use, the longer training takes – but smarter data selection can reduce the needed training steps. We saw this with SoftDedup, where reweighting redundant data cut training steps by 26% for the same result (SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training). Similarly, filtering out low-quality or uninformative data means the model doesn’t waste updates on it. One PyTorch Conference 2024 takeaway was that focusing on better data can be more effective than scaling up compute blindly. Developers are increasingly aware of the compute budget, so they strive to get “more bang for the buck” from data. Techniques like curriculum learning (feeding higher quality data more in later epochs) or progressive filtering (start with broad data, then gradually restrict to higher quality) aim to spend compute where it matters most. There is also interest in streamlined data pipelines – for instance, IBM’s high-throughput data loader for PyTorch was released to ensure that reading in billions of text samples doesn’t bottleneck multi-node training (PyTorch 2024: Training AI models faster than ever - IBM Research) . By distributing data loading and caching effectively, one can keep GPUs fed with data and reduce idle time.
Model Size vs Data Size: If one has a fixed compute budget, there’s a trade-off between training a larger model on less data or a smaller model on more data. The Chinchilla analysis demonstrated that, for a given compute budget, a smaller model trained on more tokens can outperform a bigger model on fewer tokens (Noteworthy LLM Research Papers of 2024) . This is essentially a compute-allocation decision. Recent work continues to refine these scaling laws, even incorporating how mixed-precision training affects the optimal point . For practitioners, the implication is: don’t over-index on model size without ensuring you have enough data to justify it (and vice versa). Many 2024 results (including FineWeb’s ablations ) highlight that data quality improvements can lift performance equivalently to adding billions of parameters. An efficient strategy might be to use a slightly smaller model architecture but train it on a cleaner, larger corpus – yielding a model that is cheaper to run (fewer params) yet just as strong. This has downstream benefits: smaller models are faster and cheaper at inference, so achieving a given accuracy with fewer parameters via better data is a big win for deployment.
Inference Performance and Compression: The data used in training can also influence inference behavior indirectly. Models trained on extremely large and varied data might need to be correspondingly large themselves (to absorb it), which makes serving them harder. Conversely, focusing on the most relevant data for your task can allow a leaner model that runs faster. There is also the issue of model compression: Researchers have noted that as models scale with more data and parameters (e.g. LLaMA-3 family on huge datasets), they can be “harder to quantize to low-precision formats” without losing accuracy . High-quality data might mitigate this by allowing a smaller model (easier to quantize), whereas a model bloated from ingesting too much noise might be more brittle when compressed. Another consideration is memorization – if a model memorized a lot of extraneous data, it may carry around “unneeded” capacity that doesn’t generalize, again making it larger or slower than necessary. Thus, rigorous deduplication and filtering (to avoid rote learning of training data) not only help accuracy but can improve inference efficiency by yielding a model that truly generalizes rather than stores facts.
Online/Continual Updates vs Static Training: In some real-world settings, one might add data and update models continuously (for example, incorporating newly available texts). A well-prepared dataset with clear metadata and modular structure can facilitate efficient updates – e.g., if you know which data chunk corresponds to which source, you can just update that part. This is a bit beyond standard one-off training, but it’s worth noting that data preparation choices (like preserving source IDs or metadata) can make future retraining or incremental training more efficient, instead of starting from scratch.

In summary, computational efficiency in LLM training is achieved by carefully balancing data size and quality with model capacity. It’s a dance between not underfeeding the model (to avoid undertraining) and not overfeeding it with junk (to avoid needless computation). The end goal is to get the best model per unit of compute. As one commentary put it, perhaps it’s time to update the adage to “better data beats bigger models”, capturing how targeted data curation can yield a more efficient and performant system (Rotational Labs | Recapping PyTorch: Key Takeaways from the 2024 Conference).

Domain-Specific LLMs Special Considerations

So far we focused on general-purpose LLMs trained for broad knowledge. When developing domain-specific models (e.g. medical or legal LLMs), the principles above still apply but with some twists:

Domain Data Focus: A specialized LLM is typically pre-trained or fine-tuned on large amounts of in-domain text. For example, a medical LLM would use biomedical literature, clinical notes, health-related forums, etc. A recent survey highlights common sources: “the corpus may include electronic health records (EHR), clinical notes, and medical literature” for pre-training medical LLMs (A survey of datasets in medicine for large language models). Specifically, large open datasets like PubMed abstracts, PubMed Central (PMC) full-text papers, and MIMIC-III clinical records are pivotal resources for medical LLMs . These contain the technical terminology and knowledge needed. Likewise for legal LLMs, one might gather case law texts, statutes, contracts, and legal analysis documents. SaulLM-7B (2024), a law-focused model, was trained on “an English legal corpus of over 30 billion tokens” of case texts and legislation (SaulLM-7B: A pioneering Large Language Model for Law). This dedicated 30B-token legal dataset allowed SaulLM to achieve state-of-the-art performance on legal understanding tasks with only a 7B parameter model – underscoring that targeted data can yield high pay-off in the domain.
General-Foundation + Domain-Adaption: Typically, domain models benefit from standing on the shoulders of a general model. A common recipe is: start with a general LLM pre-trained on broad data, then continue pre-training it on your domain corpus, and finally fine-tune for specific tasks. This gives the best of both worlds: the model has general linguistic and commonsense knowledge from the broad data, and then it specializes by ingesting domain-specific knowledge. For instance, many medical models start from a GPT or LLaMA baseline and then train further on PubMed articles to create a “MedGPT.” The survey of medical LLM datasets notes examples like BlueBERT and BioBERT, which used general BERT architectures pre-trained on medical corpora (PubMed, PMC, MIMIC) to adapt them to biomedical language . In 2024, Google’s Med-PaLM 2 and other clinical models followed this strategy (though specifics are proprietary). In law, SaulLM built on the Mistral 7B base model architecture and then injected legal texts . This approach is compute-efficient (since you don’t train from scratch) and tends to produce better results than training only on domain data (which might be too limited to learn the basics of language from scratch).
Quality and Compliance in Domain Data: Domain-specific data often requires extra curation for quality and compliance. In medicine, patient records must be de-identified (remove names, IDs) due to privacy laws (HIPAA). Clinical text might have sensitive or biased information that needs filtering. In legal, documents need to be up-to-date and jurisdiction-specific – mixing laws from different countries could confuse a model unless separated. There’s also the issue of bias: e.g. medical texts might underrepresent certain demographics, so model developers take care to include diverse sources (global health data, not just one country’s). Another example: toxic language may appear in clinical notes (e.g. patient abuse cases) or legal records (hate speech evidence); filtering or annotating this appropriately is important so the model learns context and not the undesirable language. Domain data can be smaller in quantity, so deduplication is done carefully to not remove genuinely unique but semantically similar case reports or articles – perhaps leaning more on soft dedup to keep valuable information.
Dataset Size for Domains: Domain LLMs usually train on fewer total tokens than general models (simply because domain-specific text is limited). For instance, PubMed has on the order of 10^9 tokens, and MIMIC-III a few times 10^8 – far less than Common Crawl’s trillions. To compensate, domain corpora are sometimes augmented with general data (to avoid catastrophic forgetting of general language) or with synthetic data. If the domain is extremely narrow (e.g. legal case law), the model might cycle through the same data multiple times (multiple epochs) to reach a good performance. There’s evidence that beyond a point, reusing high-quality domain data is better than adding low-quality out-of-domain text. Some legal LLM projects have synthesized Q&A pairs from statutes to expand their fine-tuning set. Ultimately, the scale trade-off here is different: it’s about using as much of the domain data as possible (since it’s finite), while not drifting too far from general language proficiency. Techniques like mixing a small percentage of general data during domain training can keep the model’s English fluent while still focusing on the domain specifics.
Evaluation and Iteration: Domain LLMs require domain-specific evaluation, which in turn informs data selection. For example, a medical LLM might be evaluated on answering doctors’ board exam questions or summarizing medical reports. If it fails certain categories (say, pediatrics), that indicates a gap in the training data for that subdomain and suggests obtaining more pediatric texts. The 2024 MedExamLLM platform and others are emerging to systematically test medical LLMs on real exam questions (Large Language Models in Worldwide Medical Exams: Platform ...), which then drives improvements in the dataset composition. Similarly, the LegalBench initiative compiles a suite of legal reasoning tasks. By analyzing where the model underperforms, curators can iteratively enrich the training dataset with additional texts or annotations for those weak areas. This feedback loop is especially important in specialized domains, where the ultimate measure is whether the model accurately handles domain-specific queries.

In brief, domain-specific LLMs magnify the importance of targeted data selection. Every data point counts more when you have less of it, so curating a comprehensive, high-quality domain corpus is key. The success of models like BioGPT, Med-PaLM, and SaulLM shows that with the right data (even if smaller in scale), a focused model can outperform a generic one on specialist tasks. It also highlights an encouraging point: you don’t always need trillions of tokens – for domain models, tens of billions of well-chosen tokens can suffice to reach strong performance (SaulLM-7B: A pioneering Large Language Model for Law). This makes domain LLM development accessible to smaller organizations, provided they leverage best practices in data prep and start from a good base model.

Conclusion

“Better data beats better algorithms” is an emerging mantra in the LLM community, and the 2024–2025 literature underscores this point (Rotational Labs | Recapping PyTorch: Key Takeaways from the 2024 Conference). Selecting and preparing training data is not a trivial preprocessing step – it is a cornerstone of LLM development that demands strategic planning and technical finesse. By curating a diverse yet high-quality corpus, developers can train models that are both powerful and reliable across applications. Key takeaways include the importance of heterogeneous data sources (to cover all facets of language use), rigorous filtering for quality and compliance (to remove noise, duplication, and unsafe content), and smart choices about dataset size (balancing scale with practicality and efficiency).

The evolving best practices – from using multi-stage training sets, to leveraging new tools like fast toxicity filters (IBM open sources fast HAP filter on Hugging Face - IBM Research), to conducting ablations on filtering rules (Noteworthy LLM Research Papers of 2024) – all aim at one outcome: models that generalize better, hallucinate less, and handle real-world inputs more gracefully. Industry leaders like OpenAI and Meta have been secretive about their exact data recipes, but the open research community (Mozilla, EleutherAI, Hugging Face, IBM, academic groups, etc.) has made great strides in demystifying what makes a “good” LLM dataset (Towards Best Practices for Open Datasets for LLM Training - Mozilla Foundation). As we move into 2025, expect to see even more focus on data-centric AI for LLMs – including standardized dataset documentation, tools for dataset versioning and governance, and possibly collaborative efforts to assemble large open datasets that rival proprietary collections in scale and quality. By following the insights from recent papers and sharing best practices, practitioners can ensure their LLM’s training data is a strength rather than a limitation, paving the way for safer and more capable AI systems.

Sources:

Mozilla & EleutherAI, “Towards Best Practices for Open Datasets for LLM Training,” Jan 2025. (Recommendations on data diversity, harm reduction, and openness) (Best Practices for Open Datasets for LLM Training Draft - Dataset Convening 2024)
Chen et al., “On the Diversity of Synthetic Data and its Impact on Training LLMs,” arXiv Oct 2024. (Introduces diversity metric; finds diversity boosts performance) (On the Diversity of Synthetic Data and its Impact on Training Large Language Models)
Rotational Labs (R. Bilbro), “Recapping PyTorch Conference 2024 – Key Takeaways,” Sep 2024. (Emphasizes data quality: “high-quality data > hardware or algorithms”) (Rotational Labs | Recapping PyTorch: Key Takeaways from the 2024 Conference)
Emami et al., “GneissWeb: Preparing High Quality Data for LLMs at Scale,” arXiv Feb 2025. (IBM’s 10T-token dataset with dedup & ensemble filters; outperforms smaller open sets) (GneissWeb: Preparing High Quality Data for LLMs at Scale)
Penedo et al., “The FineWeb Datasets: Decanting the Web for the Finest Text Data,” arXiv June 2024. (HuggingFace’s 15T-token web dataset; thorough filtering & ablations) (The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale)
He et al., “SoftDedup: Efficient Data Reweighting for LLM Pre-training,” arXiv July 2024. (Proposes soft deduplication to speed up training without information loss) (SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training)
Duque et al., “Toxicity of the Commons: Curating Open-Source Pre-Training Data,” arXiv Oct 2024. (Open-source pipeline for toxicity filtering with multi-level classifier and content re-writing) (Toxicity of the Commons: Curating Open-Source Pre-Training Data)
IBM Research, “A Toxic Language Filter Built for Speed,” IBM Blog Sep 2024. (Details the Granite Guardian 38M/125M models for real-time hate speech filtering in LLM pipelines) (IBM open sources fast HAP filter on Hugging Face - IBM Research)
AWS Machine Learning Blog, “Preparing your own dataset for LLM training,” Dec 2024. (Practical guide to data extraction from PDFs, HTML, etc., and common filtering steps) (An introduction to preparing your own dataset for LLM training | AWS Machine Learning Blog)
Raschka, “Noteworthy LLM Research Papers of 2024,” Jan 2025. (Highlights key 2024 papers; discusses scaling laws, FineWeb vs others, Llama-3, etc.) (Noteworthy LLM Research Papers of 2024)
Ying et al., “A Survey of Datasets in Medicine for LLMs,” Intell. Robot. Dec 2024. (Reviews medical training corpora; notes PubMed, MIMIC, etc. as crucial datasets) (A survey of datasets in medicine for large language models)
Colombo et al., “SaulLM-7B: A Large Language Model for Law,” arXiv Mar 2024. (Legal domain model trained on 30B tokens of legal text; demonstrates domain fine-tuning) (SaulLM-7B: A pioneering Large Language Model for Law)

Rohan's Bytes

Discussion about this post