ML Interview Q Series: Why does standard k-Means always converge in finite steps? Outline the proof and required assumptions.
📚 Browse the full ML Interview series here.
Comprehensive Explanation
k-Means clustering groups N data points into k clusters by iteratively assigning each point to the closest cluster centroid and then recalculating each centroid as the mean of the points assigned to it. Here is a high-level reasoning of why k-Means converges in a finite number of steps.
Objective Function
k-Means is often described as attempting to minimize the sum of squared distances between data points and their respective cluster centroids. We can express this objective function as
where N is the total number of data points, k is the number of clusters, x_i
is a data point, C_j
is the jth cluster, and mu_j
is the centroid (mean) of the jth cluster. The algorithm proceeds by:
Assigning each point
x_i
to the cluster whose centroid is nearest in Euclidean distance.Recomputing each
mu_j
as the average of all points currently assigned to clusterj
.Two-Step Iteration
k-Means executes two alternating steps (sometimes described as an Expectation (E) step and a Maximization (M) step in analogy to the EM algorithm).
Cluster Assignment Step: Each data point is assigned to the cluster that has the smallest distance between that point and the cluster’s current centroid. This step partitions the data into k sets:
C_1, C_2, ..., C_k
.Centroid Update Step: Each centroid is updated to be the mean position of the points in that cluster.
Monotonic Decrease of the Objective
In the cluster assignment step, points are assigned to whichever cluster centroid is currently closest. This choice can never increase the sum of squared distances because each point moves (if at all) to a potentially closer centroid, which reduces or keeps unchanged the overall sum of squares.
In the centroid update step, the new centroid
mu_j
is chosen as the mean of the points in cluster j. This mean is the position that minimizes the sum of squared distances to the points in that cluster. As a result, the total sum of squared distances either decreases or remains the same after each centroid update.Finite Set of Distinct Partitions
Another key insight is that there is only a finite number of ways to partition N data points into k clusters. Because k-Means can only move from one partition to another that lowers (or maintains) the objective, and it cannot revisit a partition that would increase the objective, the algorithm must eventually stop. Once a partition is reached where neither the cluster assignment step nor the centroid update step can reduce the objective, the algorithm is at a fixed point with no further updates.
Assumptions for Convergence
The most crucial assumption is that there are a finite number of data points and a finite number of clusters. This ensures the partition space is finite. We also assume we are using the standard sum of squared distances as the objective function and that all centroids are computed as arithmetic means. Additionally, each data point is strictly placed into only one cluster in each iteration.
Practical Note
Although k-Means is guaranteed to converge in a finite number of steps, it may converge to a local minimum of the objective function rather than a global minimum. In practice, multiple initializations are often tried to find a better (possibly global) minimum.
Potential Follow-Up Questions
Why does the convergence not necessarily yield the global minimum?
k-Means convergence is guaranteed in terms of stopping, but the stopping point could be a local minimum of the sum of squared distances. Because the algorithm is greedy—reassigning points in a way that locally decreases the objective—it can get stuck in partitions that are suboptimal overall. In practice, running k-Means multiple times with different random initial centroids is the usual strategy to mitigate this risk.
Could convergence take an arbitrarily large number of steps?
In principle, it can take many iterations if data points oscillate among clusters, but since the number of possible partitions is finite, k-Means must converge in at most N^k steps (though this is a loose upper bound). In most real-world scenarios, the algorithm converges much faster than the theoretical worst-case.
How does the centroid update guarantee a reduction in the objective?
When updating the centroid in cluster j, the chosen mean position is the point in space that minimizes the sum of squared distances to the points in that cluster. If the centroid were set to any other position, the objective for that cluster would be higher. This ensures the sum of squared distances does not increase.
Do we need to worry about empty clusters during the convergence proof?
Empty clusters can occur if no points are assigned to one or more clusters. One way to handle that is to remove the empty cluster or re-initialize that centroid randomly. While this is a practical consideration, it does not affect the finite convergence argument: even if a cluster is reinitialized, the algorithm is still restricted to a finite set of partitions and will eventually reach a state with no further changes.
How would you implement k-Means in Python in a simple way?
Below is a concise example using NumPy. It randomly initializes centroids, then iteratively updates them:
import numpy as np
def kmeans(X, k, max_iters=100):
# X is an array of shape (N, d)
N, d = X.shape
# Randomly pick k points as initial centroids
centroids = X[np.random.choice(N, k, replace=False)]
for _ in range(max_iters):
# Assign each point to its closest centroid
distances = np.array([np.linalg.norm(X - c, axis=1) for c in centroids])
cluster_assignments = np.argmin(distances, axis=0)
# Recompute centroids as the mean of assigned points
new_centroids = []
for i in range(k):
points_in_cluster = X[cluster_assignments == i]
if len(points_in_cluster) > 0:
new_centroids.append(points_in_cluster.mean(axis=0))
else:
# If no points in cluster, reinitialize
new_centroids.append(X[np.random.choice(N)])
new_centroids = np.array(new_centroids)
# Check for convergence
if np.allclose(centroids, new_centroids):
break
centroids = new_centroids
return centroids, cluster_assignments
When this algorithm finishes, the centroids and their corresponding cluster assignments remain fixed under subsequent iterations, reflecting the finite-step convergence principle we have discussed above.
Below are additional follow-up questions
How does feature scaling influence convergence and cluster quality?
In k-Means, distance-based comparisons drive both cluster assignments and centroid updates. Features on large numerical scales will dominate the distance calculations, potentially overshadowing other attributes. Consequently, clusters may form primarily based on features with higher magnitudes, ignoring those with smaller magnitudes. This can lead to suboptimal clusters and slower or biased convergence.
To mitigate these issues, practitioners commonly standardize or normalize features such that each has roughly comparable scale. Standardization typically transforms a feature by subtracting its mean and dividing by its standard deviation, ensuring zero mean and unit variance. Normalization often involves bringing all features into a fixed range, for example [0, 1]. By doing so, each feature has a more balanced influence, improving convergence behavior and producing more meaningful clusters.
A secondary benefit of scaling is numerical stability. Without scaling, large feature values may cause inflated squared distances, making the cost function large and potentially causing numerical overflow or underflow issues. Scaling ensures that the range of distance values is more manageable for floating-point computations.
What happens when there are outliers, and how do they affect k-Means?
Outliers are points that lie far outside the typical distribution of the data. Since k-Means uses squared Euclidean distance, an outlier exerts a disproportionately large effect on the objective function. This can lead to undesirable outcomes, such as a centroid being “pulled” toward an outlier, causing other cluster members to be poorly represented by the mean.
One common approach to mitigating outliers’ impact is to detect and remove them or to assign them to a dedicated “noise” cluster before running k-Means on the rest. Another strategy involves using a variant of k-Means that relies on more robust distance metrics or uses medians (like k-medians or k-medoids) instead of means. These modifications reduce sensitivity to extreme values, although they also come with additional computational overhead and potential changes in convergence speed.
How do you handle missing data within k-Means clustering?
Missing data complicates the distance calculations that are central to k-Means. If an observation has one or more missing features, it can be unclear how to compute a meaningful distance. Several approaches exist:
Imputation Before Clustering: Replace missing values with mean, median, or some other imputed value before running k-Means. This is straightforward but might distort genuine data patterns.
Pairwise Distance Computation: Compute distance only over the dimensions where both points are non-missing, often combined with weighting to account for the number of missing features. This approach is more nuanced and must ensure that the partial distance is still comparable across different pairs of points.
Model-Based Approaches: Use an EM-style algorithm that simultaneously clusters the data while estimating missing entries. This can improve accuracy but is more complex and computationally expensive.
In any case, missing data can pose a serious practical pitfall if not handled thoughtfully. It can lead to clusters that do not accurately represent the underlying distribution of complete data.
Is k-Means effective in high-dimensional spaces?
In very high-dimensional spaces, the notion of “distance” becomes less meaningful due to the curse of dimensionality. Data points tend to appear equidistant from one another, and the variance in each feature can become more fragmented. As a result, k-Means may converge slowly, produce weakly separated clusters, or cluster assignments might become unstable.
To address this issue, practitioners often reduce dimensionality prior to k-Means through methods like Principal Component Analysis (PCA) or autoencoders. By projecting data into a lower-dimensional representation, one can preserve most of the relevant variance while mitigating distance distortion issues. Even after dimensionality reduction, careful parameter tuning and multiple runs with different initializations are often necessary to obtain stable results.
What practical stopping criterion do you use if numerical precision is limited?
While the classical approach relies on detecting no change in cluster assignments or negligible movement in centroids, in practice, floating-point comparisons can be imprecise. For real-world datasets, one might adopt a convergence threshold such that if the maximum shift in any centroid is below a certain small epsilon, the algorithm terminates. Alternatively, one can compare the old and new objective function values and stop if their difference falls below a tolerance threshold.
Another practical approach is a maximum iteration limit to ensure the algorithm does not run indefinitely. As long as the dataset is finite and the partition space is finite, k-Means converges eventually, but bounding the number of iterations provides a safeguard against unusually slow convergence in pathological cases.
Could a centroid ever lie outside the convex hull of its assigned points?
Under standard k-Means (using the mean for centroid calculation), the centroid must lie within the convex hull of the assigned points. By definition, the arithmetic mean of a set of points is the center of mass, which is always inside the convex hull of that set. This is also why the centroid update step is guaranteed to minimize the sum of squared distances for the points in that cluster. If you observe a centroid that seems “out of bounds,” that might indicate outliers or numerical anomalies. But mathematically, it is still a convex combination of the assigned points.
How can you evaluate the quality of your k-Means clustering?
Evaluating clusters is often less straightforward than supervised learning tasks since there are no universal “labels.” Still, there are some internal and external indices one can use:
Internal Indices: Metrics that rely solely on the dataset’s features and the resulting clusters, such as the silhouette score, Davies–Bouldin index, or Calinski–Harabasz index. For example, the silhouette score compares intra-cluster cohesion and inter-cluster separation.
External Indices: When ground-truth labels are available (e.g., in a semi-supervised setting), one can measure clustering performance using metrics like accuracy or the adjusted Rand index.
Visual Inspection: For small dimensions or after dimensionality reduction, one can plot the clusters to see if they make intuitive sense. This is a subjective but often revealing approach.
Each of these measures provides different perspectives on cluster quality. It is sometimes beneficial to consider multiple metrics simultaneously to get a comprehensive view of the performance.
How do initialization strategies like k-Means++ help ensure faster convergence?
A pitfall of basic random initialization is that it can produce poor initial centroids, which can slow convergence or lead the algorithm into a low-quality local minimum. k-Means++ is a more sophisticated approach that spreads out the initial centroids by choosing them with a probability proportional to their distance from previously selected centroids. This tends to produce a better initial partition, reducing the expected number of iterations and improving the quality of the final solution.
Despite its benefits, k-Means++ has slightly higher overhead in the initialization phase because it calculates distance distributions. However, this overhead is often negligible compared to the potential improvements in convergence speed and solution quality. In large datasets, even small improvements in iteration count can translate into significant computational savings overall.
What if the dataset is so large that you cannot load it all into memory at once?
When the dataset is extremely large, standard k-Means that requires repeated passes through the entire dataset might not be feasible. In these cases, one can use:
Mini-Batch k-Means: Processes the data in small batches, updating centroids incrementally. Although it does not strictly decrease the objective function with each batch in the same way as full-batch k-Means, it often converges more quickly and still yields good cluster quality.
Online k-Means: Updates centroids one point at a time. Each point modifies the centroid assignment based on a learning rate. Though it can be noisier and might require more hyperparameter tuning, it is memory-efficient and can handle streaming data.
Sampling or Approximation Techniques: Instead of clustering all data points, cluster a representative sample first, then assign the remaining data to the nearest centroids.
These strategies address memory constraints and allow for large-scale clustering, with a trade-off in convergence guarantees and potential cluster quality.