ML Interview Q Series: Explain what One-Hot Encoding and Label Encoding are, and discuss how each transformation impacts the dimensionality of the dataset.
📚 Browse the full ML Interview series here.
Comprehensive Explanation
One-Hot Encoding and Label Encoding are commonly used techniques to handle categorical data in machine learning pipelines. Their primary purpose is to transform non-numeric data (categorical variables) into numeric representations so that many algorithms can process them effectively. Although both aim to encode categories as numbers, they differ in how they represent categories and in their impact on the dimensionality of the dataset.
One-Hot Encoding
One-Hot Encoding converts each categorical value into a vector with a 1 in the position corresponding to the category and 0 in all other positions. Suppose you have a feature, say Color
, which can take values like Red
, Green
, or Blue
. With One-Hot Encoding, we create three new features: Color_Red
, Color_Green
, and Color_Blue
. For a sample whose color is Green
, the encoded vector becomes [0, 1, 0]
.
This encoding is essential for nominal variables (categories without inherent ordinal relationships). By using a 1 among all zeros, it ensures that no numeric relationship (e.g., Green
being "larger" or "smaller" than Red
) is implied.
One-Hot Encoding typically increases the number of columns in the dataset by introducing new binary features. For a dataset that has multiple categorical columns, the total number of new columns is the sum of the number of unique categories across all categorical features.
Here:
d is the total number of categorical features.
X_k is the k-th categorical feature.
Cardinality(X_k) is the number of unique categories in the k-th feature.
Thus, if the original dataset already had some numeric features, you still add extra columns for every categorical level. As a result, dimensionality usually increases when applying One-Hot Encoding, especially if any categorical feature has a large number of distinct values.
Label Encoding
Label Encoding replaces each unique category with an integer code. If Color
can be Red
, Green
, or Blue
, you might assign:
Red -> 0
Green -> 1
Blue -> 2
This approach keeps the dimensionality the same because we are merely replacing the categorical feature with an integer-coded column. However, this representation imposes an ordering (0 < 1 < 2). While that might be acceptable for ordinal features (like Low
, Medium
, High
), it can be misleading for nominal variables, because the numeric encoding could imply Blue
is somehow "greater" than Red
.
Label Encoding is computationally simpler and more compact since the number of columns does not increase. Yet, for algorithms that are distance-based or tree-based in a way that might interpret numeric relationships in an unintended manner, this can be problematic if the feature is purely nominal.
Effects on Dimensionality
One-Hot Encoding increases the number of columns because each unique category expands into its own binary column. In contrast, Label Encoding leaves the dimensionality largely unchanged (the same count of columns, just different numeric values). So the short answer is:
One-Hot Encoding: typically leads to a higher dimensional feature space.
Label Encoding: generally does not affect the number of features; it merely recodes the categorical values as integers.
Implementation in Python
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# Example DataFrame
data = {
'Color': ['Red', 'Green', 'Blue', 'Green'],
'Size': ['S', 'M', 'L', 'M']
}
df = pd.DataFrame(data)
# One-Hot Encoding with pandas.get_dummies (simple approach)
df_one_hot = pd.get_dummies(df, columns=['Color', 'Size'])
# Label Encoding with sklearn
label_encoder = LabelEncoder()
df_label_encoded = df.copy()
df_label_encoded['Color'] = label_encoder.fit_transform(df_label_encoded['Color'])
df_label_encoded['Size'] = label_encoder.fit_transform(df_label_encoded['Size'])
print("Original DataFrame:")
print(df)
print("\nOne-Hot Encoded DataFrame:")
print(df_one_hot)
print("\nLabel Encoded DataFrame:")
print(df_label_encoded)
Common Pitfalls and Best Practices
High Cardinality: One-Hot Encoding can vastly increase the number of features if a categorical variable has many unique levels. This can lead to the "curse of dimensionality."
Ordinal vs. Nominal: Label Encoding can be misleading for purely nominal features, as it artificially imposes an ordering. However, if a feature is truly ordinal, label encoding can capture the ordinal relationship efficiently.
Memory Usage: With One-Hot Encoding, as the number of categories grows, so does the memory footprint of the dataset.
Sparse Representations: Often one-hot encoded features are sparse (many zeros). Efficient data structures (like sparse arrays) or alternative encodings (like target encoding) can help.
Follow-Up Questions
How would you handle a categorical feature with a very large number of unique categories?
When a categorical feature has thousands (or more) unique categories, one-hot encoding can blow up the feature space. In such cases, techniques like target encoding or hashing trick can help keep dimensionality manageable. Target encoding replaces a category with some aggregated statistical measure (e.g., the mean target value for rows with that category). Hash encoding uses a hashing function to assign categories into a fixed number of buckets, limiting the dimensional explosion.
What is a practical concern of using Label Encoding for nominal data in certain algorithms?
Label Encoding can be problematic if your algorithm relies on distance measures or if it splits data based on numeric thresholds in a non-ordinal manner (like linear models without proper regularization or certain nearest neighbor methods). The encoded integers imply a numeric order that does not exist in reality. This can lead the model to interpret “2” as being greater than “1,” even though these are just category labels.
Under what circumstances is Label Encoding sufficient or even beneficial?
Label Encoding is sufficient when the categorical variable is ordinal (e.g., education level “High School” < “Undergraduate” < “Master’s” < “PhD”). It can also be beneficial when the algorithm is inherently robust to integer-coded categories, such as certain tree-based models that can learn splits correctly (though care must still be taken if the category is nominal).
Can we use both One-Hot and Label Encoding together on different features?
Yes. You might apply One-Hot Encoding to nominal features to avoid imposing an artificial ordering, while using Label Encoding for naturally ordinal features. This mixed strategy is common in real-world data processing, where some categorical variables are ordinal and others are purely nominal.
What alternatives exist if we want to reduce dimensionality after One-Hot Encoding?
Dimensionality reduction methods such as PCA, autoencoders, or feature selection can be applied to reduce the expanded feature set. Hybrid encoders like embedding layers (often used in deep learning) can map high-cardinality categories into lower-dimensional continuous vectors. This approach is popular in recommendation systems or large-scale data scenarios, because it balances representational capacity and dimensional efficiency.