ML Interview Q Series: How to perform Feature Engineering on Unknown features?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
When encountering unknown features, it usually means that at inference time or during late integration of data, new features (or previously unseen categories) appear that did not exist or were not recognized during training. These could be completely new columns in tabular data, new tokens in text data, or new categories in a categorical feature. Handling these unknown features effectively is important to ensure your model stays robust and can handle production data that differs from the training distribution.
One strategy is to identify whether these unknown features carry meaningful signal or are simply noise or errors. If they are noise, they might be excluded. If they do carry signal, they require a systematic method of integration into the model. Common techniques include hashing-based transformations, dynamic embeddings, or fallback encoding strategies.
Hashing Trick for Handling Unknown Categorical Features
A popular approach to deal with unknown categories or dynamic feature spaces is to use a hashing-based technique. Instead of maintaining a dedicated vocabulary or one-hot encoder for every possible new feature, you can compute a hash value for the feature and map it into a fixed-sized vector or a fixed index. This method is commonly called the Hashing Trick. It provides a stable way to transform large or unseen feature sets into manageable embeddings or indexes.
The core formula that illustrates the hashing trick for mapping a feature value into one of K possible hash buckets can be presented as follows:
The parameter feature_value
typically refers to the categorical string or other identifier representing the feature. The term hash()
indicates any chosen hashing function (e.g., MD5, SHA, or a fast rolling hash). The parameter K
is the number of hash buckets you want to use, which defines the dimension or indexing range. The output index
is the final bucket number that can be used as a column index in a feature vector or for embedding lookups.
Once the bucket index has been calculated, you can either store a 1 in that index of a sparse vector for a linear model or retrieve an embedding from a trainable embedding table if you are using a deep learning model.
Below is a simple Python code snippet that demonstrates a hashing approach for a feature value, returning an index in a fixed range:
import hashlib
def hash_feature_value(feature_value, num_buckets=1000):
# Convert the string to bytes for hashing
# Then take the hexdigest, convert to an integer, and take modulus
return int(hashlib.sha256(feature_value.encode('utf-8')).hexdigest(), 16) % num_buckets
# Example usage:
feature_val = "some_new_category"
index = hash_feature_value(feature_val, num_buckets=1000)
print("Hashed index:", index)
Handling Unknown Features in Text Data
In text data, you often face out-of-vocabulary tokens that were not present during training. Modern tokenization approaches, like subword tokenizers (e.g., Byte-Pair Encoding or WordPiece), automatically break new words into known subwords. This mitigates the unknown token issue. With older or simpler tokenizers, you might map any unknown token to a special [UNK] bucket, or apply the hashing trick at the character or n-gram level.
Incremental Learning and Dynamic Embeddings
In some systems, you might have the capability to update your model on-the-fly or at regular intervals with newly collected data. This approach allows you to incorporate newly discovered features into the model vocabulary over time. Dynamic embedding tables in deep learning frameworks can expand to accommodate new categories if the underlying architecture and training pipeline support it. However, this can be computationally expensive and complicated to maintain in a production environment, so hashing-based solutions are still favored when the feature space can grow arbitrarily.
Important Considerations and Best Practices
When handling unknown features, be aware of potential collisions if you rely on hashing. Different feature values might hash to the same bucket. If collisions are excessive, it may degrade model performance. Increasing the number of buckets can reduce collision probability at the cost of a higher dimensional feature space or a larger embedding table.
In many real-world pipelines, you might store a “fallback” category such as “UNKNOWN” for all unseen categories, which is then learned as a single embedding or index. This method is simpler but coarser and may lose potentially valuable information from distinct unknown categories.
Follow-up Question: How do you choose between hashing trick and maintaining a dictionary of feature values?
Maintaining a dictionary of every possible feature value requires your system to know in advance all values it will encounter. This can be infeasible when the feature space is very large or if you expect new feature categories to appear over time.
The hashing trick is advantageous if you have:
Extremely large or unbounded feature spaces.
High likelihood of new feature categories at inference time.
Memory constraints that make storing a huge dictionary prohibitive.
On the other hand, if your feature space is stable and you have sufficient resources to store a mapping of all categories, maintaining a dictionary might produce better performance by avoiding collisions. You also gain interpretability when you have an explicit mapping of feature categories to indexes.
Follow-up Question: How can you handle unknown feature columns that appear after model deployment?
Unknown feature columns can appear if your data pipeline changes or if new types of data start to be collected. The best strategy depends on whether these new columns contain genuinely valuable information or might be spurious or correlated with existing features.
One approach is to retrain or fine-tune your model to incorporate those new columns if they are relevant. Before that, you might need to perform exploratory data analysis to see if the new columns are reliable and correlated with your target. If dynamic retraining is expensive or infeasible, you can build a fallback procedure that maps any unknown column to a special representation, possibly ignoring it or applying a generic transformation (like hashing it into a single representation).
Follow-up Question: What are potential pitfalls of hashing for unknown features?
Although hashing is powerful, the following pitfalls can arise:
Collisions: With large numbers of possible feature values (or new columns), collisions can become frequent if the number of buckets is not sufficiently large. This can degrade model performance.
Interpretability: Hashing removes the direct mapping from a feature value to an interpretable index. Debugging and understanding model behavior becomes more complex.
Distribution Shifts: If your training data did not contain certain key feature values or columns, your model may fail to generalize well to them, even if hashed. Representation learning is only as good as the data it sees during training.
Follow-up Question: How do you ensure your model remains robust as more unknown features appear over time?
Some best practices:
Monitor the distribution of features in production to see if completely new categories or columns appear. Automated checks can flag distribution shifts.
Consider a pipeline for periodic model re-training if new features or categories become pervasive in the data.
Use embeddings or hashing approaches that gracefully handle previously unseen categories.
Build fallback representations for unknown or anomalous features. For instance, map them to an “unknown embedding” or a single index until re-training is possible.
This strategy makes your system more adaptive and less likely to fail catastrophically when encountering unexpected input.
Below are additional follow-up questions
How can you determine if unknown features are simply noise or if they might contain important information?
The distinction between noise and potentially valuable information typically hinges on a combination of domain knowledge, statistical tests, and model-based checks:
Domain Knowledge and Data Exploration:
Consult with domain experts who can explain whether certain new features (columns or categories) are plausible or spurious. For instance, a new categorical feature that measures an aspect of user interaction might be relevant, whereas a newly introduced internal logging artifact might be irrelevant.
Perform basic descriptive statistics (mean, variance, missing-value rate) to gauge whether a new column exhibits suspiciously random patterns or actual variation correlated with known quantities.
Statistical or Correlation Tests:
Use chi-square tests or ANOVA for categorical features to see if they relate significantly to the target. High p-values may indicate randomness, while low p-values could suggest non-random structure worth modeling.
Check correlation or mutual information with the target. If the new feature displays strong correlation, it might be important.
Model-based Checks:
Retrain a small, simple model (like a decision tree or logistic regression) with and without the new feature. Compare performance metrics on a validation set or via cross-validation. Any meaningful boost may indicate the feature has genuine predictive power.
Use feature importance or SHAP values in a tree-based model to see if the feature ranks highly in importance.
Edge Cases and Pitfalls:
Overfitting: A new feature that appears valuable but is extremely sparse or only relevant for a small portion of data might cause overfitting on certain subsets.
Data Leakage: Some unknown features might inadvertently leak future information. Always verify that the new feature is genuinely available at inference time.
Shifts in Distribution: A sudden spike in unknown features might be a sign of data pipeline issues. If everything suddenly becomes “unknown,” this could be a data integration error.
By combining these strategies—domain knowledge, simple statistical checks, and model-based verification—you can make a more informed decision on the usefulness of unknown features.
What strategies can be used to reduce collisions in hashing-based encoding besides increasing the number of buckets?
When employing hashing-based techniques, collisions occur when multiple distinct feature values map to the same index. While increasing the number of buckets (K) is a straightforward way to reduce collisions, other methods exist:
Using Multiple Hash Functions (Bloom Embeddings):
Instead of using one hash function to map a feature value to a single index, you can use several independent hash functions. Each feature value is assigned multiple indices, and the model has a more distributed representation. This reduces the probability of any two distinct values colliding across all hash functions.
Revisiting Feature Grouping:
If certain subsets of features are known to be related, you can separate them into different namespaces or groups, each with its own hashing space. For example, hashing user-based features into one set of buckets, item-based features into another, and so on.
Frequency-Based Schemes:
Maintain counts of each feature’s frequency and dedicate separate indices for extremely common features (akin to storing them in a partial dictionary). Only less frequent or unseen features are hashed. This hybrid approach can significantly reduce collisions for popular categories.
Adaptive Bucket Allocation:
Some advanced systems dynamically reallocate buckets or split them if collisions exceed a certain threshold, although this is more complex and may require partial re-training or re-indexing.
Edge Cases and Pitfalls:
Memory/Speed Trade-off: Using multiple hashing spaces or significantly expanding the number of buckets may increase memory usage and slow down training/inference.
Implementation Complexity: Maintaining multiple hash functions, partial vocabularies, or dynamic bucket allocation adds engineering overhead.
Distribution Shifts: If new categories appear frequently, the hashing distribution might shift over time, necessitating re-checking collision rates.
By combining multiple hashing strategies with a good understanding of your data distribution, you can minimize collisions without infinitely expanding the number of buckets.
How can dynamic embeddings be implemented in practice, and what complexities are involved?
Dynamic embeddings aim to adapt an embedding table over time to accommodate new features (e.g., new categorical values, new words in text data). In practice, you can implement them as follows:
Initial Embedding Table + Expansion Mechanism:
Start with an embedding table that includes all known categories from training. When new categories appear, allocate new rows in the embedding matrix on-the-fly.
You might initialize these new rows randomly or based on similar existing categories.
Incremental Training or Fine-Tuning:
Update the model parameters—especially the newly added embedding rows—using recent data that includes the new categories.
Fine-tune incrementally if a full retraining is too costly. This typically involves re-running backpropagation on only the affected embeddings or on a smaller subset of the network.
Version Control and Backward Compatibility:
You must maintain a mechanism to handle lookups for older categories while adding new ones. If your production environment uses the same model weights, ensure that the newly added embedding rows do not shift the positions of existing embeddings.
Complexities & Pitfalls:
Unbounded Growth: If new categories keep appearing, embedding tables can grow to an unmanageable size, requiring pruning or consolidation.
Synchronization: Dynamic updates must be synchronized across distributed systems—both the inference engine and the training pipeline.
Catastrophic Forgetting: Focusing too heavily on updating new embeddings without continuing to see old examples can degrade performance on previously learned categories.
Data Pipeline Complexity: Handling dynamic updates in real-time or near real-time often requires sophisticated pipeline orchestration to ensure consistency.
Dynamic embeddings can be powerful for continually evolving data scenarios, but they come with substantial engineering overhead and risk of performance degradation if not carefully managed and periodically retrained or re-evaluated.
How can partial incremental learning be integrated while preserving performance on the original data distribution?
Partial incremental learning (or online learning) allows you to update model parameters as new data arrives without a full retrain. For unknown features, you might add partial increments to handle them. Here’s a detailed outline:
Selective Parameter Updates:
Restrict fine-tuning to certain parts of the model (e.g., newly introduced embeddings or the last few layers) while freezing the rest. This reduces the risk of catastrophic forgetting for the original training distribution.
Replay or Regularization:
Replay Mechanism: Keep a representative sample of past data (or synthetic examples) to periodically mix with new data. This helps the model retain knowledge of the older distribution.
Regularization Techniques: Apply constraints (e.g., L2 regularization, knowledge distillation) to penalize large deviations from previously learned weights.
Validation Across Multiple Splits:
Maintain separate validation sets representing both the old distribution and the newly appeared data. Optimize performance across both, ensuring you don’t ruin older accuracy while accommodating new features.
Edge Cases and Pitfalls:
Data Drift: If new data drastically differs from old data, partial incremental updates may fail to converge. A complete retraining might be needed eventually.
Limited Capacity: Neural models have limited capacity; a continuous stream of new features might lead to a scenario where the model needs a larger parameter space.
Complex Engineering: Implementing a well-structured replay buffer, scheduling incremental updates, and orchestrating data flows can be non-trivial in production.
Adopting partial incremental learning requires careful planning around data sampling, regularization, and validation to ensure older knowledge remains intact while accommodating new information.
What are best practices to debug or interpret hashed features in production systems?
Debugging hashed features can be challenging since the mapping from feature values to bucket indices is opaque. To interpret them:
Keep a Logging Interface:
Maintain a log of the raw feature values and their corresponding hash indices during inference. If you notice suspicious model behavior, you can trace back which original values collided.
Hash Collisions Analysis:
Periodically run an offline analysis on a sample of production data to detect if multiple high-frequency values share the same hash bucket. If collisions are high, consider adjusting your hashing strategy.
Bucket-Level Statistics:
Compute metrics for each bucket (e.g., average label, average prediction, coverage ratio) to see if certain buckets are too broad or too diverse. This can help pinpoint collisions or uninformative buckets.
Partial Vocabulary for Important Features:
For interpretability on top categories, maintain a small dictionary that handles the most common or most impactful features explicitly. This preserves interpretability for a significant portion of your data while fallback hashing handles the rest.
Pitfalls and Edge Cases:
False Interpretation: Because multiple distinct values may share a bucket, any analysis at the bucket level might incorrectly conflate different categories. Use caution in drawing conclusions.
Data Pipeline Changes: If you alter the hashing function or the number of buckets, your old debug logs will no longer map consistently, making historical comparisons harder.
Efficient logging and periodic collision checks help you remain aware of how hashed features behave and where your system might require adjustments.
In what scenarios might ignoring unknown features be beneficial, and how can you systematically evaluate that choice?
Sometimes ignoring unknown features can be the most pragmatic strategy, especially if the cost of integration outweighs the potential gain. Here are some scenarios:
High Uncertainty About Data Quality:
If newly appearing features or categories seem to be artifacts of data corruption or logging issues, filtering them out altogether might preserve model stability.
Negligible Coverage:
When the unknown features show up in a minuscule fraction of data with no significant correlation to the target, ignoring them may simplify your pipeline without hurting performance.
Resource Constraints:
Handling unknown features can expand feature engineering and embedding sizes. If resources (memory, processing power) are severely limited, ignoring them might be necessary.
Systematic Evaluation:
Ablation Study: Compare model performance on a validation set with and without the unknown features integrated. If the difference is marginal, ignoring might be acceptable.
Risk Assessment: Evaluate the potential negative impact if some truly important signals are missed by ignoring unknown features.
Practical Logging: Continuously log these ignored features. If you see them becoming more frequent or correlated with target changes over time, reevaluate your decision.
Edge Cases and Pitfalls:
Future Value: A currently rare unknown feature might become more prevalent. By ignoring it now, you risk missing emerging patterns.
Mixed Approaches: You might partially ignore them (e.g., map them all to a special “ignored” category) while monitoring frequency. This is a middle ground that doesn’t fully discard them but treats them as a single fallback category.
Always weigh the potential predictive gain against engineering complexity and resource constraints to decide whether ignoring unknown features is justified.
How do you handle unknown features in time-series or streaming data contexts?
Time-series and streaming data often present an evolving feature space, making robust handling of unknown features essential:
Sliding Window Training:
Continuously update or retrain your model on the most recent data segment while discarding outdated data. Incorporate a mechanism to handle new categories or columns that appear in the window.
Online Hashing:
For fast-moving streams, rely on hashing to encode new feature categories quickly. This ensures real-time or near real-time adaptability. Monitor collision rates over time.
Adaptive Feature Pruning:
If you get repeated unknown columns that show minimal predictive power, implement an automated mechanism to prune them from your feature set to keep your model lean.
Temporal Drift Checks:
Maintain metrics on how often new features or categories appear. Sudden spikes could indicate distribution shifts or data integrity problems.
Edge Cases and Pitfalls:
Delayed Label Availability: In streaming contexts, the label might be available only after some time. Handling unknown features in a real-time system may thus require storing them until labels arrive, which complicates training.
Resource Management: Time-series or streaming data can be large. Continuous learning can become expensive if you need to incorporate every unknown feature. A hashing approach is often more scalable than building a dictionary for ephemeral features.
Ensuring robust pipeline monitoring and version control for time-series or streaming data solutions is crucial for preventing silent failures when new features arise unexpectedly.
How do you approach unknown features that emerge in a completely new domain (e.g., multi-modal system)?
Multi-modal systems combine various data types, such as text, images, audio, or tabular data. New features might arise from a domain that was never previously integrated:
Domain Transfer or Zero-Shot Learning:
If your model is designed for multi-modal inputs (e.g., text and images) and a new modality appears (e.g., audio), you may need a domain transfer approach. Some frameworks allow zero-shot learning for new domains by repurposing general embeddings or pretrained encoders.
Bridge Feature Extraction:
Convert new domain data into a familiar representation. For instance, if new image-like data appears, you could still apply a standard vision model to produce embeddings that your main model can interpret as another “feature vector.”
Exploratory Pilot Model:
Build a smaller, specialized model for the new domain to assess feasibility and potential gain. If it shows promise, integrate it into the main pipeline.
Edge Cases and Pitfalls:
Model Architecture Constraints: If your existing architecture was never built to handle new modalities, you might need a significant redesign.
Insufficient Training Data: A new domain might lack enough labeled examples, causing your integrated model to underperform or overfit.
Cross-Domain Feature Interactions: Subtle interactions between old and new domains can be hard to capture if the model was not designed for cross-modality fusion from the outset.
When new domains emerge, a thorough pilot study helps determine whether the cost and complexity of integrating them outweigh the potential performance benefits.
What are advanced out-of-vocabulary (OOV) detection methods for NLP beyond subword tokenization?
Although subword tokenization (BPE, WordPiece) has become standard, there are more advanced methods to handle OOV words:
Contextual Embeddings (e.g., ELMo, BERT, GPT):
These models assign embeddings based on surrounding context. Even if a word is rare or unseen, the model leverages context to generate a plausible embedding. This effectively reduces the unknown token problem.
Character-Level or Morphological Models:
Some NLP architectures (e.g., CharCNNs, morphological analyzers) break words into individual characters or morphological segments, allowing robust handling of unseen words. This is especially useful in morphologically rich languages.
Hybrid Approaches:
Combine subword tokenization with a fallback character-level encoder. If a word is not in the subword vocabulary, the character-level model handles it. This approach captures morphological nuances and reduces unknown tokens to almost zero.
Neural Spelling Correction or Normalization:
In user-generated text scenarios, ephemeral words or misspellings can be mapped to the closest known tokens with a neural spelling corrector. This technique can reduce the amount of truly unknown tokens.
Edge Cases and Pitfalls:
Ambiguity: Rare tokens might appear in contexts that do not disambiguate them well, resulting in less reliable embeddings.
Computational Overhead: Character-level or morphological models can be computationally heavy. Hybrid methods may need specialized implementations.
Multiple Languages: If your data is multi-lingual, you might need multi-lingual subword vocabularies or language identification before OOV handling.
These advanced methods minimize the necessity of an explicit [UNK] token and provide richer handling of unexpected text inputs.
What are the computational trade-offs when using hashing-based approaches versus dictionary-based approaches for unknown features?
Deciding between hashing and dictionary-based methods involves balancing multiple factors:
Memory Footprint:
Hashing: You decide on a fixed number of buckets. Memory usage scales with your chosen dimension and does not grow with the number of unique feature values.
Dictionary: Memory usage scales linearly with the size of your vocabulary or number of categories. This can explode if new values are continually discovered.
Lookup and Training Complexity:
Hashing: A single hash operation (plus modulus) is generally O(1). Collisions, however, can degrade model performance.
Dictionary: Requires building and maintaining a map from feature values to indices. Lookups are O(1) with a hash map, but adding new categories means updating the dictionary and potentially re-indexing your data pipeline.
Collision vs. Accuracy:
Hashing: Collisions can negatively impact accuracy. Minimizing collisions often requires increasing the number of buckets, which increases memory usage.
Dictionary: There are no collisions by definition, so dictionary approaches can be more accurate when your feature space is not excessively large.
Handling Unseen Categories:
Hashing: Easily handles unseen categories without code changes, as they map to a bucket automatically.
Dictionary: Unseen categories need an update to the dictionary (or a fallback “unknown” category), complicating real-time systems.
Edge Cases and Pitfalls:
Large-Scale Systems: For extremely large data sets or online pipelines, dictionary-based approaches may become impractical if memory is insufficient or if new categories appear frequently.
Changing Data Distribution: If some categories become dominant or many new categories appear, the chosen hashing or dictionary approach might need revision to avoid performance drops or excessive memory usage.
Explainability: Dictionaries preserve a clear mapping for interpretability. Hashing obscures direct interpretation but is often more flexible.
Careful empirical testing on real production data typically clarifies which method—or hybrid approach—offers the best balance of accuracy, memory cost, and engineering complexity.