ML Interview Q Series: How do you encode Categorical Data which consists of both Ordinal and Nominal data types?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Categorical features can be broadly grouped into ordinal and nominal. Ordinal data has an inherent ranking or order (for example, “low < medium < high”), whereas nominal data has no ordering (for example, “dog”, “cat”, “bird”). The key distinction between these two categories of features is whether you must preserve a meaningful ordering in the encoding. If the ordering is lost (as it would be if you simply applied arbitrary integers to your ordinal data), you lose valuable information. On the other hand, if you impose an ordering on nominal data, you might introduce misleading relationships into your model.
Ordinal Data
The most straightforward approach for ordinal data is to represent each distinct category by an integer that preserves the ordering. For instance, if you have size categories {XS, S, M, L, XL}, you could map them as XS=0, S=1, M=2, L=3, XL=4. By assigning these numeric encodings, you retain the knowledge that XL > L > M > S > XS. This method is often called label encoding for ordinal data, but with the explicit intention of capturing the order. The numerical distances between these encoded values might not always directly translate to real-world distances, but at the very least, you have respected the rank relationship.
One subtlety is that if your ordinal categories are cyclical (for example, days of the week or months of the year), a naive integer mapping might still introduce a mismatch at the boundaries. In such cases, a cyclical transformation (like using trigonometric functions) can help better preserve the cyclic ordering. However, for simple ordinal data such as a satisfaction rating {“very unhappy”, “unhappy”, “neutral”, “happy”, “very happy”}, a direct integer mapping is usually sufficient.
Nominal Data
For nominal data, the categories have no explicit ordering. You can use a variety of encoding techniques that do not impose any rank. Some of the common approaches:
One-Hot Encoding This is a popular method where each category in a nominal feature is mapped to a vector dimension (also known as a dummy variable). Suppose the feature “animal” can be “dog”, “cat”, or “bird”. One-hot encoding would create three binary columns. Each column corresponds to one possible category. For example:
dog -> [1, 0, 0]
cat -> [0, 1, 0]
bird -> [0, 0, 1]
This representation ensures that all categories are treated equally, without imposing any ordering. One downside is that if a feature has many possible categories, one-hot encoding can inflate the dimensionality.
Dummy Coding This is effectively the same concept as one-hot encoding except that you typically drop one of the categories (sometimes called reference coding). For example, if you have three categories, you would only keep two columns. This helps avoid linear dependencies in linear models (the “dummy variable trap”).
Count Encoding, Frequency Encoding, and Target Encoding For high-cardinality features, you might resort to alternative encoding techniques. Count encoding replaces each category by the number of times it appears in the dataset. Frequency encoding replaces each category by its relative frequency. Target encoding replaces each category by some function of the target variable (e.g., the mean or probability of success). These methods reduce dimensionality but can risk overfitting if not handled carefully (for example, by using cross-validation or adding regularization).
Embedding-Based Encoding In deep learning contexts, an embedding can be learned for nominal categories, effectively mapping them into a dense vector space. This is especially common in recommendation systems or any scenario with massive numbers of discrete categories.
Practical Implementation Example
Below is a small Python snippet illustrating how one might separately encode an ordinal feature (like a satisfaction rating) and a nominal feature (like an animal type), using pandas and scikit-learn:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Example DataFrame
df = pd.DataFrame({
'satisfaction': ['very unhappy', 'unhappy', 'neutral', 'happy', 'very happy', 'neutral'],
'animal': ['dog', 'cat', 'cat', 'bird', 'dog', 'dog']
})
# Ordinal mapping for satisfaction
ordinal_map = {
'very unhappy': 0,
'unhappy': 1,
'neutral': 2,
'happy': 3,
'very happy': 4
}
df['satisfaction_encoded'] = df['satisfaction'].map(ordinal_map)
# One-hot encoding for nominal feature
encoder = OneHotEncoder(sparse=False) # For demonstration, we use dense output
animal_encoded = encoder.fit_transform(df[['animal']])
# Create a DataFrame with encoded columns
animal_encoded_df = pd.DataFrame(animal_encoded, columns=encoder.get_feature_names_out(['animal']))
# Combine the original df with encoded columns
df_combined = pd.concat([df, animal_encoded_df], axis=1)
print(df_combined)
By doing it this way, you ensure that your ordinal data’s natural ordering is preserved (using integer mapping) and your nominal data is represented without imposing any ranking (one-hot vectors).
Why Not Simply Label-Encode Everything?
If you label-encode nominal categories, say, {dog=1, cat=2, bird=3}, the model might treat “bird” as being larger than “cat”, which is larger than “dog”, even though there is no true ordering in the domain. This artificial ordering can mislead many algorithms (especially linear models and distance-based methods like k-nearest neighbors). That’s why one-hot encoding or similar encoding is almost always preferred for purely nominal data.
Handling Large Cardinality in Nominal Data
When you have a nominal feature with a large number of possible categories, one-hot encoding might become impractical. This could lead to a very high-dimensional feature space. In such scenarios, you may turn to techniques like count encoding, frequency encoding, target encoding, or learned embeddings. These methods help reduce dimensionality but must be applied carefully to avoid overfitting or leakage of target information.
Follow-up Questions
How do you deal with ordinal features that are not strictly linear in spacing?
You can consider monotonic transformations or use more domain-specific knowledge about the spacing. If the ordinal scale has boundaries (like times or cyclical data), transforming them with sine/cosine can capture cyclical nature. Another approach is to use a custom mapping based on domain knowledge to reflect meaningful distances between categories.
What if an ordinal feature has missing categories or new unseen categories at inference?
If you encounter categories that were not in your original ordinal mapping, you could either map them to a special “unknown” category or try to infer where they fit in the existing order. Ordinal encoders in libraries can be configured to handle unknown categories gracefully, though the default behavior often raises an error or ignores them.
Why might we sometimes prefer embeddings for nominal data in deep learning?
For high-cardinality nominal features, learned embeddings capture relationships in a lower-dimensional space, helping deep networks discover more nuanced patterns. Embeddings are especially useful in recommendation systems, language models, and any context with massive discrete sets (e.g., millions of product IDs).
Are there any potential pitfalls to target encoding?
Target encoding can cause target leakage if used improperly. If you replace each category directly with the mean of the target over the entire training set, the model could memorize these means. The recommended approach is to use techniques like cross-fold averaging, smoothing, or adding random noise to mitigate leakage and overfitting.
What if you have prior domain knowledge about ordinal spacing?
If you have strong domain knowledge that “very happy” is exactly twice as positive as “happy”, you could encode them as 2 and 1, respectively. This approach makes sense only if the domain knowledge is robust. Otherwise, it is safer just to reflect the order without imposing a rigid spacing assumption.
By using the appropriate encoding for each type of categorical data, you preserve the most crucial information. For ordinal data, maintain the order. For nominal data, avoid imposing any artificial ordering. In practice, you choose your encoding method depending on data size, cardinality, domain knowledge, and the specific machine learning algorithm you plan to use.
Below are additional follow-up questions
How do you handle multi-label categorical data in a single column?
In some datasets, instead of having exactly one label (like “cat” or “dog”), you might have multiple categories attached to the same instance (for example, “Comedy” and “Action” as genres for the same movie). Handling this scenario requires a different approach than simply using one-hot encoding.
One option is to create a multi-hot representation, where each possible category is given a column, and a row can have multiple 1’s if the instance contains multiple categories. This leads to a sparse representation but preserves the multi-label nature. Another approach is to treat each category combination as a separate entity. However, if the number of combinations is large, this becomes intractable.
A subtle pitfall is that multi-label data can artificially inflate the dimensionality, especially if the number of possible categories is large. Methods such as hashing or embedding-based approaches might be considered to handle the curse of dimensionality. Also, you need to be sure that your model can handle multi-label tasks (e.g., a multi-label classification framework) rather than a traditional single-label classifier.
What if some ordinal levels are missing in your training set but appear in the test set?
Ordinal data relies on a defined order among categories, but real-world data might be incomplete. Suppose you have an ordinal scale [1, 2, 3, 4, 5], but only [1, 3, 5] appear in your training set. At test time, you could see an example with level 2. A naive label encoder will not have a mapping for that value.
One way to handle this is to include placeholders for all possible levels in the domain, even if the training data lacks some levels. Alternatively, you can dynamically handle unseen categories by assigning them to “unknown.” A more nuanced approach might attempt to guess where the missing level fits in the ordering. However, this often requires domain knowledge or external data.
This scenario underscores the importance of consistent data collection and robust production pipelines. Before deployment, ensure your transformations are robust to any out-of-vocabulary or missing ordinal levels.
How do you interpret feature importance when using one-hot encoding for nominal data?
Tree-based models like Random Forests or Gradient Boosted Trees can sometimes dilute the importance of nominal features encoded via one-hot. Each split might consider only a single one-hot column at a time, making the “spread” of importance across multiple dummy variables tricky to interpret.
One strategy is to aggregate feature importance across all dummy variables that originated from the same nominal feature. You can sum their importances to get a holistic view. Another approach is to use permutation importance, which more reliably captures how the entire group of dummy variables impacts predictive performance.
A subtlety arises if the nominal feature has many levels. Some levels might appear infrequently, and a model may never or rarely split on those corresponding dummy variables. This can reduce the measured importance, even if those categories might matter in niche situations. Carefully inspect how the model is splitting on nominal features in real practice.
If you have hundreds of columns, each with high-cardinality nominal features, how can you reduce dimensionality effectively?
With a large number of categorical features, each having many categories, straightforward one-hot encoding can produce an enormous sparse matrix. This can lead to severe computational and memory burdens.
One approach is frequency or count encoding. Instead of creating many dummy variables, you replace each category with the frequency of that category in the dataset. Another is hashing-based methods, where you hash categories into a fixed number of “buckets,” controlling dimensionality.
In deep learning contexts, category embeddings are particularly useful. Each category is mapped into a learned embedding vector, significantly reducing dimensionality compared to one-hot. Overfitting can be mitigated by regularization techniques such as dropout on the embedding layer.
In all these methods, you must watch for potential loss of interpretability or introduction of collision errors (as in hashing). Proper cross-validation is essential to confirm the benefit of these encodings.
How do ensemble models deal with ordinal vs. nominal encodings differently?
Ensemble models, especially tree-based methods, can naturally handle many encoding schemes, but there are subtle differences. For nominal data, one-hot encoding is generally more reliable because trees can isolate categories by splitting on the corresponding dummy variable. However, decision trees might sometimes handle integer-coded nominal values suboptimally by imposing thresholds where none truly exist.
For ordinal data, if you have a numeric encoding that aligns with the order, tree splits can leverage that linear arrangement. On the flip side, if your ordinal data is improperly encoded (e.g., randomly assigned integers that do not reflect the real order), the tree might treat it like nominal data and not exploit the underlying progression.
In ensemble methods that rely on distance metrics (for instance, certain custom ensemble approaches or nearest-neighbor-based components), imposing ordinal relationships incorrectly could distort distance calculations. The key is to ensure your encoding accurately represents the relationships in the data.
What approaches would you recommend for encoding rare categories in nominal features?
Rare categories in nominal data pose a risk of fragmentation: you end up with many categories that appear only a few times. Standard one-hot encoding might create columns that have almost all zeros and very few ones.
One approach is to group all rare categories into an “other” bucket, which helps consolidate categories that might not carry enough data on their own. Another tactic is frequency or count encoding, which inherently collapses rare categories into a common range of low frequencies. Target encoding with regularization can also help, but it must be applied carefully to avoid overfitting.
A nuanced issue arises if some rare categories are semantically meaningful (for instance, certain medical codes or specialized product IDs). Simply lumping them into “other” could mask critical distinctions. In such cases, you might want to domain-engineer groupings to preserve important information while consolidating truly infrequent or unimportant categories.
If your dataset contains mixed ordinal and nominal variables, how do you approach automated feature engineering?
You might build a pipeline that first identifies which columns are ordinal vs. nominal. For example, you can maintain a dictionary that maps each ordinal column to its ranking, or you can rely on domain knowledge or metadata that flags columns as ordinal. Then, you apply a custom transformer that encodes ordinal columns numerically in an order-preserving way, while applying one-hot (or another suitable method) to nominal columns.
An important consideration is whether to do feature generation (e.g., polynomial features) differently for ordinal vs. nominal data. For ordinal data, polynomial expansions might preserve meaningful interactions, while for nominal data, you might create interaction terms between categories. This can explode in dimensionality if handled naively. Automated feature engineering tools like featuretools can do this, but it’s crucial to apply domain logic so you don’t end up with an intractable feature set.
How do you prevent data leakage when encoding nominal variables using target encoding?
Data leakage can occur if you encode each category’s mean target (or other statistics) using the entire dataset, including the record you’re trying to predict. This artificially inflates performance because it includes information from the test record in the encoding.
A standard way to prevent this is to use out-of-fold strategies:
Split your training data into folds.
For each fold, compute the target statistics using only data from the other folds.
Replace the categories in the held-out fold with these statistics.
Combine the folds afterward.
Adding random noise or smoothing the target values is also common to reduce variance and avoid overfitting. Even after these precautions, you should carefully cross-validate to confirm you haven’t inadvertently leaked target data.