ML Interview Q Series: What are some common practices for Preprocessing data for LLMs?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Preprocessing for large language models (LLMs) typically entails several steps to ensure the data is consistent, relevant, clean, and standardized. High-quality preprocessing can improve model performance and training stability, reduce noise, and optimize computational resources.
Data collection and data inspection. It is critical to inspect raw data, understand its format, and evaluate potential issues such as presence of HTML tags, irrelevant metadata, or different languages. In particular, large public datasets often contain a mix of text types, including code, forum threads, and social media content.
Cleaning and normalization. Common text cleaning tasks include:
Removing excessive white spaces or invalid characters.
Converting special Unicode characters to a standard format.
Stripping out HTML, XML, or other markup if it is not desired in the training data.
Normalizing punctuation and accent marks to maintain consistency.
Deciding whether to convert text to lowercase (depending on the language and the type of model, this may or may not be desirable, especially for multi-lingual or code-related contexts).
Deduplication. Data deduplication is typically crucial for large corpora. Repetitive or duplicated sequences can bias the model or waste compute. Approaches range from straightforward string hashing to more sophisticated near-duplicate detection (e.g., hashing sub-sequences to find partial duplicates).
Filtering out low-quality data and applying domain-specific heuristics. A common practice is removing content that is purely random or noise, such as large blocks of repeated text or machine-generated text from older logs. If building a domain-specific model, it might be necessary to remove content outside that domain.
Segmenting and chunking. LLMs have maximum sequence length constraints. In many cases, documents must be split into segments that fit the context window. Careful chunking strategies can maintain continuity (e.g., chunking on sentence boundaries or paragraphs) to prevent abrupt transitions that might harm the model’s ability to capture context.
Tokenization. Modern LLMs frequently rely on subword-based tokenizers (e.g., Byte Pair Encoding, WordPiece, SentencePiece). These tokenizers handle out-of-vocabulary terms by splitting them into smaller subword units. The tokenizer is typically trained on a large corpus to build a subword vocabulary, then consistently applied to transform raw text into integer token IDs.
Removing personal identifiable information (PII). For ethical and sometimes legal reasons, it is common to either remove or mask PII such as phone numbers, emails, addresses, or full names. This helps ensure responsible use and mitigates privacy concerns.
Language identification and consistency. In multilingual settings, it is helpful to detect the language of each text snippet to maintain either separate language-based pipelines or consistent tokenization strategies. This can reduce noise and confusion when different scripts or languages are mixed.
Handling special tokens and metadata. The training process may require special tokens such as <bos>
, <eos>
, <sep>
, or custom prompt templates. Preprocessing ensures that these markers are inserted consistently to guide the model. Where relevant, metadata like “page title” or “document ID” can be converted to standardized tokens to let the model learn from contextual information.
Train-validation-test split. Even with massive corpora, it is crucial to maintain a clean separation between training, validation, and test sets to evaluate performance reliably and prevent data leakage. For LLMs, this split is sometimes done by random sampling of documents, ensuring that entire documents remain within one split.
Data compression and caching. Given that training LLMs can require billions of tokens, storing the preprocessed data efficiently in a compressed or binary format (e.g., TFRecords, Hugging Face Datasets) can speed up reading times and reduce disk space usage.
Below is a brief example of a basic Python snippet demonstrating partial text cleaning:
import re
def basic_clean_text(text):
# Remove HTML tags
text_no_html = re.sub(r"<.*?>", "", text)
# Normalize whitespace
text_no_whitespace = " ".join(text_no_html.split())
# Optionally remove non-alphanumeric characters (depends on use case)
text_alphanumeric = re.sub(r"[^a-zA-Z0-9\s.,!?;:]", "", text_no_whitespace)
return text_alphanumeric
sample_text = "<p>Hello World!!! </p> <br> This is a test!!"
cleaned_text = basic_clean_text(sample_text)
print(cleaned_text)
This example shows a simplified approach. Real-world pipelines often require far more sophisticated preprocessing, including language identification, advanced tokenization, and domain-specific heuristics for quality filtering.
What if the data contains multiple languages or code?
One must consider whether to maintain a single multilingual tokenizer or train specialized tokenizers per language. In many practical setups:
A single SentencePiece or WordPiece tokenizer is trained on a mix of languages to capture relevant subwords across scripts.
Language detection is used to separate or tag data, especially if certain languages have minimal representation. This ensures each language gets enough coverage in the model’s learned representations.
For code, some datasets may preserve indentation, syntax tokens, or special formatting. This can enhance the model’s ability to generate and reason about code accurately.
How do we avoid data leakage and contamination?
Data contamination occurs when test or validation data inadvertently ends up in the training set. For large-scale corpora, it is easy to have overlaps if data is web-scraped from multiple sources. Deduplication and careful dataset splitting by URL or document ID help mitigate contamination. It is wise to maintain unique identifiers for each document or data record so that all content from a given source is constrained to one split.
How to handle very large documents?
When working with books or extremely long documents, chunking is typically done at the paragraph or section level. The trade-off is ensuring that each chunk retains enough context while respecting the maximum sequence length. Overlapping windows can be used if context near the boundaries is important. In practice, too much overlap can inflate the dataset size, so a balance is necessary.
How do we handle sensitive content or PII?
Many organizations develop custom filters to identify sensitive personal data such as phone numbers, home addresses, or email addresses. Potential strategies include:
Masking (e.g., substituting the information with a placeholder).
Complete removal of lines or paragraphs containing PII.
Automated scripts to detect (regex-based, ML-based) and remove such content. There can also be legal or ethical compliance requirements like GDPR or HIPAA, necessitating specialized pipelines for data filtering and logging.
Could you provide an example of advanced data deduplication?
A robust approach might involve hashing entire documents or paragraphs. More advanced techniques compare n-gram fingerprints. For instance:
Split text into overlapping n-grams.
Hash each n-gram to create a fingerprint.
Compare fingerprints among documents or paragraphs for near-duplicate identification. If the overlap surpasses a threshold (e.g., 80% common hashed n-grams), the text can be considered a duplicate or near-duplicate. This ensures partial overlaps or minor modifications do not flood the training set with repetitive content.
How do we handle domain-specific special tokens and metadata?
When data contains extra fields—like user ID, timestamps, or location info—some scenarios benefit from exposing these fields as special tokens so the model can leverage that context. In other cases, the fields might not be beneficial and are removed to prevent confusion. If used, these tokens might be prefixed or bracketed, for example: <user_id_123> Hello everyone
. Consistency is key, so the model reliably learns how to interpret them.
Follow-up Question: How might we decide if lowercasing is necessary?
Lowercasing can reduce vocabulary size for certain languages (e.g., English), simplifying modeling. However, it removes case-specific signals, which might be crucial for named entities, acronyms, or code. If the LLM is intended for formal text, typically preserving case is beneficial. For languages with morphological differences marked by case (e.g., German), losing capitalization can degrade performance. An empirical approach is often best: train smaller prototypes with and without lowercasing, evaluate, and compare downstream tasks.
Follow-up Question: What are some strategies to ensure our tokenizer handles rare or new words effectively?
Training a subword tokenizer is often done on large corpora to capture frequent and semi-frequent word segments. For truly rare or emerging words:
The subword mechanism naturally decomposes them into smaller units.
A careful choice of vocabulary size helps balance coverage of frequent tokens with the ability to represent infrequent or out-of-vocabulary (OOV) words.
If frequent domain-specific terms exist (e.g., biomedical jargon), it can be advantageous to include domain data when training the tokenizer, ensuring subword merges align well with these specialized tokens.
Follow-up Question: What if we have extremely noisy data, like forum chats with random characters or emojis?
Forum or social media data can contain irregularities, such as repeated punctuation, emoticons, or random ASCII art. Several approaches may be taken:
Removing or mapping emojis to textual placeholders, for example converting “:smile:” to
<smile_emoji>
.Trimming or replacing repeated characters like “heyyyyy” to “hey.”
Preserving them if they convey sentiment or valuable context. This depends on the LLM’s intended usage domain (e.g., a chatbot might benefit from the presence of emojis or elongated words). Always evaluate the effect of these transformations in context. Over-sanitizing data can reduce the model’s understanding of how real users interact in online spaces.
Follow-up Question: Are there risks in over-filtering data?
Yes. Overly aggressive filtering can remove helpful linguistic structures, domain-specific expressions, or subtle cues in text. For instance, if a filter incorrectly identifies certain domain-specific jargon as “noise,” the model may lose valuable data. Similarly, filtering out all non-English segments could hamper a multilingual model’s performance. An iterative approach—evaluate the impact of filters, measure downstream performance, and refine thresholds—helps strike a balance between cleanliness and coverage.
Below are additional follow-up questions
How do we handle outlier content or "shock content" that might appear in web-scraped data?
One strategy is to implement specialized filtering and safety checks. For instance, you could develop classification models or lexicon-based approaches to detect offensive or highly disturbing content. Once detected, you might remove it entirely from your training corpus, or if partially relevant (e.g., discussing a controversial topic), mask or redact specific segments while preserving broader context.
An important pitfall is the risk of over-filtering. If the classification threshold is too sensitive, you might strip away valuable context about how real users communicate, which could degrade a chatbot’s ability to handle difficult or sensitive conversations. Conversely, under-filtering could result in harmful or shocking content persisting in the dataset, potentially causing the model to learn undesirable behaviors or produce offensive outputs. Thorough testing, refining classification thresholds, and regular human-in-the-loop audits are best practices for ensuring the correct balance between preserving diverse, authentic data and removing harmful extremes.
If we have multiple versions of the same data with different formatting, how do we unify them?
The key is developing a normalization layer that consistently processes formatting differences before tokenization. This can include:
Stripping or homogenizing markup (e.g., removing extra HTML tags or converting them to a uniform format).
Converting bullet lists or enumerations to a consistent representation, possibly preserving some structure if it aids downstream tasks.
Reconciling conflicting metadata fields or ensuring the same metadata field has a standardized name across versions.
A subtle pitfall is accidentally altering the meaning of text if, for instance, you remove certain markup that indicates emphasis or important domain-specific notes. For example, in code-related data, whitespace or indentation could be critical. In a multilingual setting, you must ensure that normalization rules do not inadvertently remove crucial diacritical marks. Ideally, you measure the final distribution of data features (like average length, presence of markup tokens) to confirm each version has been homogenized appropriately without losing key context.
In terms of memory usage and computational constraints, how do we optimize preprocessing for extremely large corpora?
One optimization is streaming-based processing. Rather than loading the entire dataset in memory, you can process data in small batches (e.g., line by line) and write out intermediate results incrementally. This way, you avoid memory exhaustion. Further efficiency gains come from:
Parallel processing: Splitting the data into chunks across multiple CPU cores or machines.
Using distributed systems like Apache Spark or Ray to handle large-scale text transformations.
Compressing the intermediate outputs with efficient formats (e.g., TFRecord, Parquet) that support fast sequential reads.
A common pitfall is performing multiple passes over the data without need. If tokenization or cleaning steps can be combined into one pass, you save considerable time and resources. Also, if deduplication is performed incorrectly (e.g., storing huge sets of n-grams in memory), you can run out of space quickly. Careful selection of hashing or approximate data structures (like Bloom filters or MinHash) can reduce memory overhead.
How do we systematically evaluate the quality of our preprocessed data before training an LLM?
One approach is to establish quantitative and qualitative benchmarks. Quantitatively, you might compute:
Token distribution: Are the top tokens as expected? Do we see an overabundance of special tokens or strange fragments?
Vocabulary coverage: Check if common words or subwords for your target domain are included at a reasonable frequency.
Document length statistics: Ensure the average/median length aligns with expectations.
Language distribution: In a multilingual dataset, confirm each language is represented proportionally.
Qualitatively, you can sample random documents and manually inspect them to see if the text is coherent, free of undesired noise or personal data, and consistent with your domain goals. A pitfall here is only sampling a small fraction; if your dataset is massive, you might miss corner cases. Automated anomaly detection (e.g., computing perplexity using a smaller “sanity-check” model or a language detection model for out-of-place text) can help flag suspicious clusters.
Are there any pitfalls with text standardization across languages that do not share the same script?
Yes. When languages rely on different character sets (e.g., Arabic, Chinese, Russian), naive approaches such as removing diacritics might break essential letters or cause loss of meaning. Similarly, normalizing text that uses compound characters or different forms of the same character might inadvertently alter meaning. If you apply a single standard across languages, you risk damaging data for languages that rely on script-specific markers or accent marks.
A best practice is to maintain separate normalization pipelines or rulesets per language group. For instance, keep diacritics in French, Spanish, or Vietnamese text but handle them differently for transliterated content in informal settings. The pitfall is having partial coverage of your languages or mixing them inadvertently. You might rely on language identification to route text to the correct normalization procedure, reducing the chance of misapplication of rules.
How do we address the issue of misaligned text and labels or metadata?
Misalignment often arises when text data has been shifted, truncated, or incorrectly joined with corresponding labels or metadata. One example is a dataset of news articles linked to incorrect publication dates or authors. Another is a conversation dataset where user and system messages are out of sequence.
Prevention is best: maintain robust data pipelines that preserve alignment from the data’s original source. If reindexing or splitting occurs, store an ID with each text snippet to track it back to original metadata.
When misalignment is discovered, you could:
Build automated checks that compare text features (like mention of certain date references) to the associated metadata (the date field).
Use cross-checking with external references to see if the text content’s domain matches the labeled domain.
Implement re-alignment scripts that attempt to match partial text to known meta fields.
The major pitfall is ignoring mild misalignments, which can systematically degrade training. Models trained on misaligned data might learn spurious correlations (e.g., incorrectly associating specific authors with certain text).
How do we handle noise from user-generated content in structured or semi-structured data?
User-generated content often appears in platforms like forums or comment sections, where posts may be peppered with markup artifacts, emoticons, or incomplete sentences. A robust approach includes:
Parsing out HTML or JSON structures if the data is stored in a structured format.
Applying cleaning heuristics (e.g., removing repeated punctuation or random keyboard spam).
Preserving useful indicators like emoticons or hashtags if they convey sentiment or topic context, but standardizing them to consistent placeholders if they vary wildly in form.
A hidden pitfall is removing content that might reflect actual user language patterns, which are helpful for a chatbot or social media analysis. Another pitfall is failing to remove insecure user tokens or session IDs that might appear in the data and lead to privacy breaches. Carefully balancing data authenticity and cleanliness is essential.
How do we handle domain adaptation effectively through preprocessing steps?
Domain adaptation typically involves mixing general corpora with domain-specific data (for instance, medical text) in a controlled ratio. The preprocessing steps may differ for each domain. For the specialized domain, you might preserve more technical jargon, avoid lowercasing acronyms, or retain special formatting (e.g., chemical formulas). For the general domain, you might apply more aggressive cleaning.
To ensure the model does not forget general language patterns, consider strategies like:
Maintaining separate tokenizers or vocab expansions that incorporate domain-specific terms.
Fine-tuning with domain data but using a general pretraining corpus for initialization.
A subtle pitfall is that domain data might be significantly smaller, leading the model to overspecialize if it is overrepresented in training. Conversely, if domain data is too diluted within a large general corpus, the model may not adapt sufficiently. Monitoring domain-relevant metrics and adjusting mixing ratios is advisable.
How do we handle text from older sources or historical contexts that uses archaic language, and what challenges might arise from that?
Historical texts or older documents often contain archaic spelling, typography, or obsolete characters. If you want your model to understand or generate text from that era:
You might need a specialized tokenizer or a custom mapping that recognizes archaic spellings or glyphs. For instance, older English texts might use “ſ” (long s) or archaic forms of words.
You could treat archaic words as distinct tokens to capture their historical usage, or map them to modern equivalents if modern interpretation is desired.
A major challenge is that these texts might have inconsistent orthography even within the same document. The pitfall is inadvertently splitting a single archaic term into random subwords or normalizing away archaic tokens that are actually central to understanding. If the goal is modern textual modeling, partial normalization might be okay, but if the goal is historical textual analysis, losing these archaic markers undermines authenticity. For bilingual or multilingual historical texts, careful language identification is crucial, as archaic forms might be poorly recognized by standard detection methods.
How do we handle dynamic or streaming data sources that might continuously be updated, and how does that affect our preprocessing pipelines?
When new data arrives in real time (e.g., social media streams, logs), your preprocessing pipeline must be adaptive and possibly incremental. Key steps include:
Incremental tokenizer updates or dynamic vocab expansions (though this can be non-trivial for subword tokenizers, as their vocab is typically fixed after training).
Automated filtering for PII, shock content, or other undesired material that appears in the new data.
Scheduling frequent re-runs or “mini-batches” of data ingestion and cleaning to keep the training corpus fresh.
One pitfall is version control of the tokenizer. If you significantly modify your tokenizer mid-training, it can disrupt consistency with earlier data. Another potential issue is concept drift—if the domain or language usage changes rapidly (for example, new slang terms, trending topics, or new named entities), the model might fail to keep up. Monitoring domain performance metrics and periodically re-training or fine-tuning is an effective strategy.