ML Interview Q Series: How do Gini Impurity and Entropy differ when constructing Decision Trees?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
One of the central steps in building decision trees involves choosing the best feature (or split) at each node. In classification trees, criteria such as Gini Impurity and Entropy are commonly used to measure the homogeneity of the classes in each potential split. Although they both serve the same purpose—guiding the selection of splits—they have differences in how they measure impurity.
Gini Impurity
Gini Impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if we randomly assign a label according to the distribution of labels in the subset. If p_i represents the proportion of class i in a particular node among C classes, the Gini Impurity can be calculated with the following formula:
where p_i is the fraction of samples belonging to class i. The expression 1 - sum of p_i^2 measures how mixed the classes are in the node. If all instances in a node belong to the same class, the Gini Impurity becomes 0.
Entropy
Entropy measures the level of uncertainty or unpredictability in the distribution of classes. Its popular form in decision trees is the Shannon Entropy. For a node that contains a mixture of classes, the Entropy is calculated as:
where p_i is the fraction of samples in class i. If the distribution at a node is perfectly homogeneous (all samples belong to one class), Entropy becomes 0, indicating no randomness or uncertainty.
Comparison
Gini Impurity and Entropy often produce similar outcomes in terms of which splits they consider most optimal. However, there are subtle differences:
Computational Complexity: Entropy uses a logarithmic term, which can be slightly more computationally expensive to calculate than Gini Impurity. In modern computing environments, this difference is usually negligible, but in extremely large-scale scenarios, Gini might be marginally faster.
Penalizing Class Imbalance: Entropy is generally thought to place a relatively larger penalty on highly imbalanced class distributions. That can lead to deeper splits in certain cases. Gini also penalizes imbalances but is often considered to have a slightly more “aggressive” preference for larger partitions that isolate the majority class quickly.
Practical Performance: In practice, Gini Impurity and Entropy-based splits often result in very similar decision trees, both in structure and accuracy. Some frameworks use Gini by default (e.g., scikit-learn’s DecisionTreeClassifier defaults to criterion='gini'), since it tends to be a bit faster in calculation.
Practical Implementation in Python
Below is a minimal example using scikit-learn’s DecisionTreeClassifier. You can change the criterion to either "gini" or "entropy" to see how it affects the tree.
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load a sample dataset
iris = load_iris()
X = iris.data
y = iris.target
# Train-Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Using Gini
clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf_gini.fit(X_train, y_train)
# Using Entropy
clf_entropy = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=42)
clf_entropy.fit(X_train, y_train)
# Evaluate
pred_gini = clf_gini.predict(X_test)
pred_entropy = clf_entropy.predict(X_test)
print("Gini Accuracy:", accuracy_score(y_test, pred_gini))
print("Entropy Accuracy:", accuracy_score(y_test, pred_entropy))
Often, the results will be similar, but slight variations can occur due to the difference in how each criterion evaluates impurity.
What are some typical follow-up questions?
How do we decide which criterion to use in practice?
In most real-world scenarios, the choice between Gini and Entropy rarely leads to drastically different performance. Some rules of thumb are:
If your dataset is huge: Gini might be marginally faster to compute because it avoids calculating logarithms.
If you care about interpretability in terms of information gain: Entropy aligns directly with the concept of information theory, which sometimes appeals to those who want to quantify splits via an information-gain perspective.
Try both: Empirically evaluating both on your dataset is often the most reliable way to decide.
Can the choice of Gini or Entropy lead to different depths in the tree?
Yes. Gini and Entropy differ slightly in how they penalize mixed class distributions. Entropy might lead to deeper branches in certain edge cases. However, the difference is often minor, and other hyperparameters such as max_depth, min_samples_split, or pruning methods will have a more significant impact on the final depth of the tree.
Why do some decision trees default to Gini Impurity rather than Entropy?
The primary historical reasons include:
Computational speed: Though marginal, Gini is faster since it avoids the log operation.
Similar results: Gini and Entropy are functionally very similar in how they pick splits. Thus, many implementations choose Gini as the default for simplicity.
Historical choice: Early CART (Classification and Regression Trees) implementations adopted Gini Impurity, making it a convention that spread.
What happens if a class has probability 0 with Entropy?
If p_i = 0 for some class i, the term p_i log_2(p_i) is conventionally treated as 0. In practical implementations, the code often checks if p_i is zero before taking the logarithm. This ensures numerical stability and avoids log(0).
Could we use other metrics, such as Misclassification Error, in decision trees?
Yes. In theory, a decision tree can use other metrics, like Misclassification Error, which is 1 - max(p_i). However, Gini Impurity or Entropy are favored because they are more “sensitive” to class distribution changes. Misclassification Error does not provide a smooth gradient of change as the class proportions shift, potentially leading to less informative splits during training.
Could these criteria be extended to regression problems?
For regression trees, different metrics are used, most commonly Mean Squared Error (MSE) or Mean Absolute Error (MAE). While there are analogies in concept, Gini Impurity and Entropy specifically measure class distribution purity and are thus not applied in typical regression trees.
How does tree pruning interact with the choice of Gini or Entropy?
Tree pruning (or applying min_samples_split, min_samples_leaf, or max_depth) aims to reduce overfitting by controlling the complexity of the tree. Regardless of whether you use Gini or Entropy, these regularization parameters will have a larger effect on the final model’s complexity than the choice of impurity metric itself. Both Gini and Entropy can lead to overfitting if the tree is allowed to grow unchecked, so pruning is crucial in practice.
Are there any boundary cases or numerical stability concerns?
Yes:
Very small datasets: If a split results in very few samples in a child node, estimates of p_i can be unreliable, and numerical computations can become noisy.
Extremely large datasets: For Entropy, repeated logarithm calculations can sometimes introduce floating-point precision issues if not implemented carefully. Libraries like scikit-learn handle this internally, so it typically is not a problem in everyday use.
These considerations typically do not prevent you from using either criterion but are worth keeping in mind for industrial-scale applications or extremely skewed class distributions.