ML Interview Q Series: Under what conditions would you favor Hierarchical Clustering over k-Means?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Hierarchical clustering and k-means are distinct approaches to partitioning data into clusters, and there are scenarios where hierarchical clustering is preferred over k-means. Below is a detailed reasoning behind such a choice.
Hierarchical Clustering does not require pre-specifying the number of clusters. Instead, it constructs a dendrogram that represents the nested grouping of data points, allowing inspection at different levels of granularity. You can choose where to “cut” this dendrogram to decide how many clusters you want. In contrast, k-means requires that you explicitly define k in advance, and this can be challenging when you lack strong prior knowledge about the appropriate number of clusters in your dataset.
Another benefit of hierarchical clustering is that it can capture non-spherical clusters better than k-means, which optimizes its objective under the assumption that clusters tend to be roughly spherical and separated primarily by means. k-means tends to work best when clusters are well-separated in a Euclidean sense. When the actual distribution deviates significantly from spherical shapes (e.g., elongated, nested, or arbitrary shapes), hierarchical clustering approaches—especially those that use average linkage or complete linkage—can often yield better intuitive groupings.
Hierarchical clustering is also beneficial for smaller or moderately sized datasets because you can visually inspect the dendrogram to see how clusters form, merge, and relate to each other. For large datasets, however, hierarchical clustering can become prohibitively expensive in both time and space, because its complexity often grows rapidly with the number of data points.
k-means is popular because of its computational efficiency on large datasets, but if interpretability is important, or you need a more flexible approach to reveal intrinsic data patterns at various scales, hierarchical clustering often provides deeper insight.
The core objective of k-means is typically given by minimizing the sum of squared distances of points to their respective cluster centroids:
Here, k is the specified number of clusters, x is a data point belonging to cluster C_j, and mu_j is the centroid of cluster C_j. Hierarchical clustering, on the other hand, relies on a linkage criterion to iteratively merge or split clusters, and does not require an explicit centroid-based objective.
When data has many outliers or is high dimensional, k-means might be sensitive to those outliers (because the mean is not a robust statistic). Hierarchical clustering can be more flexible because you can choose a linkage method that is less sensitive to these issues (for example, complete linkage can be more robust to outliers than using a mean-based approach).
Overall, you might choose hierarchical clustering when:
You do not want to specify the exact number of clusters in advance.
You care about understanding the hierarchical relationships among clusters.
You have smaller or moderately sized datasets and want to visualize the structure using a dendrogram.
Your dataset might contain non-spherical clusters that k-means could fail to accurately partition.
You need a more robust approach to outliers, depending on the chosen linkage method.
What kind of linkage criteria exist in Hierarchical Clustering, and how do they differ?
Common linkage methods include single linkage, complete linkage, and average linkage. Single linkage measures the distance between two clusters as the minimum distance between any point in one cluster and any point in the other cluster. Complete linkage uses the maximum distance between any point in one cluster and any point in the other. Average linkage takes the mean distance across all pairs of points in the two clusters. These different linkage criteria can produce significantly different cluster shapes in the dendrogram.
Single linkage can lead to long “chains” of points, especially if there are points in the dataset that act as a bridge between clusters. Complete linkage tends to produce more spherical clusters and can handle outliers better. Average linkage is a compromise, often yielding balanced clustering outcomes.
How do you determine the number of clusters in Hierarchical Clustering?
You typically observe the dendrogram and look for meaningful “cuts.” You might see a large jump in the vertical distance at a certain level, indicating a natural separation. Alternatively, you can use metrics such as the silhouette score or the cophenetic correlation coefficient to evaluate cluster quality at different cuts. This approach is more exploratory than in k-means, where you either rely on prior knowledge or test different values of k.
How do you handle large datasets when Hierarchical Clustering might be expensive?
You can use approximate or hybrid approaches. For example, you could apply a faster method like mini-batch k-means or standard k-means to partition the data into smaller groups first, and then run hierarchical clustering on the centroids of those groups. This hierarchical “summary” method can significantly reduce computational costs while preserving some structure of the data.
Does Hierarchical Clustering require feature scaling?
Distance-based clustering methods (like hierarchical clustering) usually benefit from feature scaling, especially if the features in your dataset have different ranges or units. Standard scaling (subtract mean, divide by standard deviation) or min-max scaling often helps prevent features with larger numeric ranges from dominating the distance metric. In hierarchical clustering, if one feature is on a much larger scale than another, the cluster assignments will be disproportionately influenced by that feature’s variation.
How do outliers affect Hierarchical Clustering?
Outliers can affect the distance calculations between clusters. Certain linkage criteria might be more sensitive than others. For instance, average linkage can dilute the effect of a single outlier, while single linkage might create an undesired chain effect with outliers. In practice, it is useful to detect and handle outliers (e.g., by trimming or capping) before clustering, or by using a robust linkage criterion that reduces the weight of extreme values.
What if we only care about spherical clusters?
If your data distribution and problem context suggest that clusters are roughly spherical, or if you have a large dataset where interpretability of a dendrogram is less critical, k-means can be more efficient and straightforward. Hierarchical clustering might be unnecessary in these cases. You should match the method to the geometry and scale of your problem to ensure both accuracy and computational feasibility.
How might you validate or compare the results of Hierarchical Clustering vs. k-means?
Common cluster validation approaches include:
Silhouette scores, which measure how similar points are within the same cluster compared to points in other clusters.
Purity or Adjusted Rand Index (ARI), if you have ground-truth labels.
Comparison of cluster centroids and size distributions to domain knowledge.
Because hierarchical clustering can be more exploratory, you might also visualize the dendrogram and see whether the splits reflect natural divisions in the data. If ground truth labels are available, you can compute external metrics (like accuracy or ARI) for both k-means and hierarchical clustering and compare which approach yields clusters that align best with known labels.