ML Interview Q Series: How is Entropy employed when building Decision Trees?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Entropy in the context of decision trees is a measure of how pure or impure a dataset is, in terms of the classification labels it contains. A perfectly pure dataset (all instances belonging to the same class) has zero entropy, whereas an evenly mixed dataset has higher entropy. Decision tree algorithms commonly use entropy, or a related metric (like Gini Index), to evaluate how well a particular feature splits the dataset into purer subsets.
The measure of entropy in a set S with n distinct classes can be expressed with the following core formula:
In this formula:
S is the set of samples.
n is the number of distinct classes in S.
p_i is the fraction of items in S that belong to class i.
The logarithm is taken base 2 (hence the term “log_2”), reflecting an information-theoretic perspective of “bits” of information.
When using entropy within a decision tree, the goal is to choose the split that achieves the largest reduction in impurity (i.e., the largest decrease in entropy). This reduction is known as information gain. Information gain for a split on feature A is measured by comparing the original entropy of the dataset E(S) to the weighted average entropy of each subset after splitting by A:
Here:
IG(S, A) is the information gain obtained by splitting on feature A.
Values(A) is the set of all possible values that feature A can take in the dataset.
S_v is the subset of S where feature A has value v.
|S_v| is the number of samples in subset S_v, and |S| is the total number of samples in the original set.
E(S_v) is the entropy of subset S_v.
The decision tree construction algorithm will choose the feature (and corresponding splitting threshold for continuous attributes) that yields the highest information gain for the current node. After performing this split, the algorithm proceeds recursively on each child, eventually creating a tree structure that attempts to maximize purity at every level.
This approach leads to a tree that uses entropy in a greedy manner: at each step, it only considers how well a feature splits the dataset at that node without accounting for future splits. Despite its greedy nature, using entropy and information gain is an effective heuristic, and it forms the basis of well-known decision tree algorithms such as ID3, C4.5, and their extensions.
Below is a short Python snippet demonstrating how one might compute entropy for a single set of class labels:
import math
from collections import Counter
def compute_entropy(labels):
"""
labels: a list of class labels
returns: computed entropy as a float
"""
label_counts = Counter(labels)
total_count = len(labels)
entropy = 0.0
for count in label_counts.values():
p = count / total_count
entropy -= p * math.log2(p)
return entropy
# Example usage:
labels_example = ["cat", "cat", "dog", "dog", "dog"]
print(compute_entropy(labels_example))
In this code:
We count the occurrences of each label.
We compute the proportion p of each label.
We sum over -p * log2(p) to get the total entropy.
Why is the logarithm taken base 2?
Using a base 2 logarithm in the entropy formula naturally aligns with the concept of measuring information in bits. Entropy in information theory is typically expressed in bits (when the log base is 2), which makes interpretation intuitive: higher bits indicate higher uncertainty. Though any logarithm base could be used (entropy would simply be scaled differently), base 2 is the conventional choice.
How does Entropy compare to Gini Index?
Some decision tree implementations (like CART) use the Gini Index instead of Entropy. Both metrics measure impurity, but they have slightly different properties:
Entropy tends to give larger penalty to highly impure distributions.
Gini Index has somewhat simpler computation and is often faster in practice.
Despite these differences, both measures usually yield similar performance. In practice, the choice is more a matter of convention and slight computational efficiency differences.
What are potential edge cases when using Entropy in Decision Trees?
One edge case arises when a feature has many possible values (particularly in datasets with high cardinality or many splits). In such scenarios, the feature might artificially yield high information gain just because it happens to isolate individual data points, leading to overfitting. Techniques like pruning or applying a minimum sample split threshold help mitigate this issue.
Another edge case is when the dataset distribution is extremely skewed. If most examples belong to one class, pure or nearly pure splits might offer only small gains in entropy. In such cases, additional metrics like class weights or balanced sampling might be necessary to guide the split decisions more effectively.
Can Entropy be used for regression trees?
Entropy is primarily used for classification tasks. In regression trees, we often rely on measures like mean squared error or mean absolute error to measure the “impurity” of a node (i.e., the variance of numerical targets). Entropy, being a measure of categorical uncertainty, does not naturally extend to continuous targets without discretization.
How can overfitting be controlled in a tree built with Entropy?
Decision trees can overfit by growing too deep, perfectly classifying every training point. To mitigate overfitting:
Pruning (either post-pruning or pre-pruning) helps limit depth or reduce nodes that do not significantly contribute to predictive power.
Minimum samples per split or leaf constraints prevent splits that do not produce significant information gain.
Limiting the maximum depth sets a cap on how many splits can be made.
These methods help the model generalize better to new, unseen data.