ML Interview Q Series: How would you describe a consensus clustering method that uses k-Means as its foundation?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
k-Means based Consensus Clustering is a technique to create a more robust clustering solution by combining multiple runs (or multiple variations) of k-Means. Rather than relying on a single clustering result, consensus clustering harnesses various partitions to reduce the instability that can arise from random initialization, sampling variations, or parameter perturbations.
The general idea is to run k-Means several times, possibly with different random seeds, subsets of features, or even subsets of data. The outcomes of these multiple runs are aggregated into a “consensus” that seeks to capture stable groupings of samples. This consensus is often based on a co-association matrix or similarity matrix, which counts how frequently each pair of samples ends up in the same cluster across different runs. A final clustering step, which can again be k-Means or another algorithm, is then performed on this aggregated matrix to produce the final, consensus-based partition.
k-Means itself attempts to solve the objective function of minimizing the within-cluster sum of squared distances from each data point to its assigned cluster centroid. The canonical objective function for k-Means can be represented by:
Where S_i represents the set of points belonging to cluster i, and boldsymbol{mu}_i is the centroid (mean) of points in cluster i. k indicates the total number of clusters.
In a consensus approach, you would repeat this procedure multiple times (or in multiple ways) to generate different partitions. After generating these partitions, you aggregate them by measuring how often points co-occur in the same cluster. Finally, you cluster this co-occurrence information to obtain a consensus partition.
Consensus clustering can often be more stable than a single run of k-Means, because it reduces the likelihood of basing decisions on a single initialization or local optimum. However, it adds computational overhead, since multiple runs are needed, and it does not automatically fix fundamental limitations of k-Means—such as sensitivity to outliers or difficulties with complex cluster shapes.
How to Construct the Co-association Matrix
A common method is to build an n x n matrix (where n is the number of data points) in which each entry (i, j) indicates how many times point i and point j appear in the same cluster across all runs. Normalizing this matrix by the number of runs transforms it into a similarity measure between 0 and 1. This matrix is then used by a clustering algorithm (sometimes k-Means itself, or a hierarchical approach) to determine the consensus clusters.
Practical Considerations
One practical factor in k-Means based Consensus Clustering is the selection of k, the number of clusters. Each repeated run of k-Means must use the same k for the clustering results to be aggregated in a straightforward manner. Another consideration is the similarity threshold you might use when forming a final partition from the co-occurrence matrix. You may choose a similarity cutoff that indicates which points are strongly or weakly connected.
It is also essential to be aware of computational costs. If k-Means is repeated many times on a large dataset, the overall runtime could become substantial. Techniques like mini-batch k-Means can help if the dataset is large.
Potential Follow-up Questions
Why might we prefer consensus clustering over a single k-Means run?
Consensus clustering mitigates the randomness of k-Means initialization or sampling. A single run of k-Means can converge to different local minima if initialized differently. By ensembling multiple solutions, we can reduce the variance and potentially arrive at clusters that appear more consistently across many runs. This approach can yield a more reliable partition, especially in datasets with non-trivial cluster structures or noise.
How do we decide on the number of clusters when running k-Means based Consensus Clustering?
Choosing k usually involves methods like the elbow method, silhouette scores, or domain knowledge. In a consensus setting, it can be trickier because varying k across runs complicates the aggregation process. Often, the same k is used across all runs, and one might repeat the entire consensus process for multiple values of k to see which yields the most stable or interpretable result. Some practitioners also use gap statistics or other model selection criteria.
What if different runs generate contradictory clusterings?
Different runs may conflict about which points cluster together. In the co-association matrix, contradictory assignments would lower pairwise co-occurrence counts for certain points. As a result, such points may form their own small clusters or remain outliers in the final consensus. Some implementations might weight different runs (e.g., based on an internal validity metric) so that high-quality partitions have greater influence on the final result.
Does consensus clustering fix limitations like sensitivity to outliers?
Consensus clustering by itself does not inherently solve the outlier sensitivity problem or limitations caused by the assumption of spherical, equally sized clusters. However, repeated runs can reveal which clusters are stable despite outliers. If an outlier strongly affects the centroid in only a few runs but not in the majority, its influence might be diluted in the final consensus result. Still, if the underlying data has distributions that severely violate k-Means assumptions, other algorithms might be more suitable.
How can we implement k-Means based Consensus Clustering?
Below is a rough Python outline illustrating multiple runs of k-Means, constructing a co-association matrix, and generating a final consensus clustering:
import numpy as np
from sklearn.cluster import KMeans
def kmeans_consensus_clustering(data, k, num_runs):
n_samples = data.shape[0]
co_assoc = np.zeros((n_samples, n_samples))
# Perform multiple runs of KMeans
for _ in range(num_runs):
kmeans = KMeans(n_clusters=k, init='random', n_init=1) # single init for demonstration
labels = kmeans.fit_predict(data)
# Update co-association matrix
for i in range(n_samples):
for j in range(i+1, n_samples):
if labels[i] == labels[j]:
co_assoc[i, j] += 1
co_assoc[j, i] += 1
# Normalize co-association matrix by num_runs
co_assoc /= num_runs
# Perform final clustering on co-association matrix (for example, KMeans on the similarity values)
# Convert similarity to distance by subtracting from 1
distance_matrix = 1 - co_assoc
# Flatten or transform the matrix to a suitable input for KMeans (or use another method)
# Here, we'll treat each row of co_assoc as a feature vector. This is one naive approach.
final_kmeans = KMeans(n_clusters=k)
final_labels = final_kmeans.fit_predict(co_assoc)
return final_labels, co_assoc
This example shows a simple conceptual path. In practice, you might employ more sophisticated approaches for the final clustering step (e.g., spectral clustering on the co-association matrix, or hierarchical clustering). The key idea is that repeated k-Means runs generate multiple partitions, which are aggregated into a co-association matrix, followed by a final clustering step that produces a more stable, consensus-based partition.