ML Interview Q Series: Why is the distance approach used by k-Medoids often deemed superior to that of k-Means?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
k-Means and k-Medoids are both clustering methods that aim to partition a dataset into a predefined number of groups. While they share some conceptual similarities (like assigning data points to clusters and then updating cluster representatives), there is a fundamental difference in how they choose those representatives and how distances are calculated and minimized. k-Means seeks to minimize the sum of squared Euclidean distances to a centroid, whereas k-Medoids can optimize for a more general distance measure to a medoid (an actual data point within the cluster).
Sensitivity to Outliers
One notable benefit of k-Medoids is that it is more resilient against outliers. k-Means relies on the mean of points within a cluster, so a single outlier can significantly shift the centroid. In contrast, k-Medoids selects an actual data point (the medoid) to represent each cluster. Since medoids are less likely to be influenced by extreme values, the overall clustering arrangement tends to be more robust.
Use of General Distance Metrics
k-Means requires Euclidean distance because the mean is the optimal minimizer of squared Euclidean distance. By contrast, k-Medoids can use a wide variety of distance functions (Manhattan, Minkowski, Hamming, or any other metric). This flexibility makes k-Medoids suitable for data that does not naturally conform to an assumption of Euclidean geometry.
Objective Function for k-Means
k-Means attempts to minimize the following objective function that sums the squared Euclidean distances from each data point to its assigned cluster centroid:
Here, N is the total number of data points, K is the number of clusters, r_{ij} is an indicator (1 if data point x_i belongs to cluster j and 0 otherwise), x_i is the i-th data point, and mu_j is the centroid of cluster j.
Objective Function for k-Medoids
k-Medoids tries to minimize the following objective function, which sums the (generic) distance between each data point and its cluster’s medoid:
In this expression, N is the total number of data points, K is the number of clusters, r_{ij} is an indicator (1 if data point x_i belongs to cluster j and 0 otherwise), x_i is the i-th data point, m_j is the medoid (one of the actual data points in cluster j), and d(...) is any valid distance metric chosen by the practitioner.
Because k-Medoids can incorporate arbitrary distance measures, it has a flexibility advantage over k-Means, which is restricted to squared Euclidean distances. This benefit often means k-Medoids is a better choice for categorical data, complex or high-dimensional data where Euclidean distance may not be the most suitable metric, or in cases where outliers are extremely problematic.
Algorithmic Complexity
Although the distance measurement used by k-Medoids can be more robust, the algorithm itself is typically more computationally expensive than k-Means. The process of finding a medoid that truly minimizes the distance to all other points in a cluster can be more time-consuming than updating a mean. However, there are optimization methods such as Partitioning Around Medoids (PAM) and related strategies that make k-Medoids more scalable.
Implementation Example in Python
Below is an illustrative Python snippet using the KMedoids
class from sklearn_extra.cluster
to show how one might implement k-Medoids for a simple dataset. Note that you may need to install the scikit-learn-extra
library:
from sklearn_extra.cluster import KMedoids
import numpy as np
# Example synthetic data
X = np.array([
[1, 2], [2, 3], [2, 2],
[8, 7], [7, 8], [8, 8]
])
# Initialize and fit KMedoids
kmed = KMedoids(n_clusters=2, metric='manhattan', random_state=0)
kmed.fit(X)
# Output results
print("Cluster labels:", kmed.labels_)
print("Medoids:", kmed.cluster_centers_)
This snippet highlights the typical usage of k-Medoids, where you can specify the preferred distance metric such as manhattan
. k-Means in scikit-learn, on the other hand, uses Euclidean distance and cannot be changed to other distance metrics by default.
How k-Medoids Handles Outliers Differently
k-Means cluster centers (means) can be pulled significantly by the presence of an outlier. Because k-Medoids chooses a data point as the cluster representative, a single extreme value in the dataset has less power to affect the medoid in a substantial way. This leads to more stable cluster assignments and often better real-world performance, especially in data that contains noise or anomalies.
Can k-Medoids Use Any Distance Metric?
Yes, one of the main appeals of k-Medoids is its ability to incorporate a variety of distance metrics. In many applications, Euclidean distance is not the most appropriate measure. For example, if you have a dataset involving sequences or strings, you might use edit distance. If you are working with geographical data, you might use Haversine distance. k-Medoids can be employed with all such metrics without fundamentally altering the algorithm.
In Which Scenario Might k-Means Be Preferred?
Despite k-Medoids’ flexibility, k-Means can be preferred in large-scale scenarios where computational efficiency is paramount and where Euclidean distance is appropriate. k-Means usually converges faster and has fewer computational demands for each iteration. If the dataset is known to be relatively free of outliers, or if the underlying geometry is clearly Euclidean, k-Means can be an ideal choice for its speed and simplicity.
Are There Potential Pitfalls with k-Medoids?
One concern is the computational overhead. Because medoids must be chosen among the actual data points, k-Medoids can be slower, especially if the dataset is large. In addition, if you pick an unsuitable distance metric for your data, you might still end up with poor cluster quality. Proper feature engineering and an understanding of how to measure “closeness” in your domain are crucial regardless of whether you choose k-Means or k-Medoids.
What If the Data is Categorical?
k-Medoids is often a better choice than k-Means in this case. With categorical data, Euclidean distance may not make sense. Using a metric like Hamming distance (which counts how many features are different between two data points) might be more natural. Since k-Medoids permits arbitrary distance definitions, it aligns nicely with clustering of categorical or mixed-type data.
Summary of Key Differences
k-Means is generally faster and more suitable when:
The underlying assumption of Euclidean geometry holds.
Outliers are minimal or not highly impactful.
The dataset is large, and efficiency is crucial.
k-Medoids is generally more robust and flexible when:
The dataset has outliers or noisy observations.
A specific or non-Euclidean distance metric is needed.
You want the cluster representative to always be an actual data point rather than an abstract mean.