ML Interview Q Series: What differentiates standard K-Means clustering from Spherical K-Means clustering?

Apr 08, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

Classical K-Means clustering is an algorithm that aims to minimize the sum of squared Euclidean distances between each data point and the centroid of the cluster to which it is assigned. Its objective function is often written as

Connect with me on X (Twitter)

where x_i is a data point in cluster C_k, and mu_k is the centroid of cluster k. In plain text:

x_i is a data point.
mu_k is the centroid (mean) of the cluster k.
| x_i - mu_k |^2 is the squared Euclidean distance between x_i and mu_k.
K is the total number of clusters.
C_k is the set of data points belonging to cluster k.

On the other hand, Spherical K-Means modifies the objective by focusing on the cosine similarity between normalized data vectors and their cluster centroids. It effectively groups data based on similarity in direction rather than their absolute distance in a Euclidean space. The spherical K-Means objective often normalizes both the data points and the centroids to lie on the unit sphere, and then tries to maximize their dot products (or equivalently minimize 1 minus their cosine similarity). One way to represent its objective is

In plain text:

x_i is a data point.
mu_k is the cluster centroid vector.
Each vector is normalized to unit length, so their dot product x_i dot mu_k / ( norm(x_i) * norm(mu_k) ) is just the cosine similarity.
The term 1 - (x_i dot mu_k / ( norm(x_i) * norm(mu_k) )) can be seen as a form of spherical distance (angle-based measure).

Key Differences

Distance Metric and Normalization Classical K-Means uses Euclidean distance and centers that minimize squared errors in this Euclidean space. Spherical K-Means, by contrast, typically uses cosine similarity (which is an angular measure) and normalizes all vectors to unit length before clustering.

Centroid Computation In standard K-Means, the centroid is the mean (arithmetic average) of all points in the cluster. In spherical K-Means, the cluster centroid is also computed as the mean of the (normalized) points but then re-normalized to unit length to keep it on the unit sphere.

Use Cases Classical K-Means is suitable when absolute distances in the feature space are meaningful. Spherical K-Means is beneficial when direction matters more than magnitude, such as in text processing where TF-IDF vectors are frequently used and the data is often high-dimensional with sparse vectors.

Behavior with High-Dimensional Data Classical K-Means in high-dimensional, sparse data may be strongly influenced by the magnitude of feature counts. Spherical K-Means is more robust in scenarios where the overall direction of the feature vector is more significant than the exact magnitudes. This is common in natural language processing tasks (e.g., clustering documents by topic).

Interpretation of Clusters Classical K-Means clusters can be interpreted in terms of point proximity. Spherical K-Means groups points that share a similar direction or pattern of non-zero features, even if they are not close in Euclidean distance.

# A minimal example of how one might implement spherical k-means in Python

import numpy as np
from sklearn.preprocessing import normalize

def spherical_k_means(X, k, max_iter=100):
    # X is assumed to be an array of shape (n_samples, n_features)
    # Step 1: Normalize X to unit vectors
    X_normed = normalize(X, axis=1)

    # Step 2: Randomly initialize centroids
    # Usually pick random samples or random vectors
    np.random.seed(42)
    indices = np.random.choice(X_normed.shape[0], k, replace=False)
    centroids = X_normed[indices]

    for _ in range(max_iter):
        # Assign samples to nearest centroid (in terms of cosine similarity)
        # cos_sim between x_i and mu_k is x_i dot mu_k / (norm(x_i)*norm(mu_k))
        # but X_normed and centroids are already unit vectors => dot product is enough
        similarity = np.dot(X_normed, centroids.T)
        labels = np.argmax(similarity, axis=1)

        # Recompute the centroids
        new_centroids = []
        for cluster_idx in range(k):
            cluster_points = X_normed[labels == cluster_idx]
            if len(cluster_points) == 0:
                # If no points belong to this cluster, reinit the centroid
                new_centroids.append(centroids[cluster_idx])
            else:
                # Average all points in the cluster, then re-normalize
                centroid_avg = np.mean(cluster_points, axis=0)
                norm = np.linalg.norm(centroid_avg)
                if norm == 0:
                    new_centroids.append(centroids[cluster_idx])
                else:
                    new_centroids.append(centroid_avg / norm)
        new_centroids = np.array(new_centroids)

        # Check for convergence or update
        if np.allclose(centroids, new_centroids, atol=1e-6):
            break
        centroids = new_centroids

    return labels, centroids

What happens if data is not normalized for Spherical K-Means?

If you do not normalize your data vectors before running spherical K-Means, then the algorithm is no longer purely measuring direction. Instead, vectors with higher magnitude might dominate the computation, preventing the algorithm from working as intended. The essence of spherical K-Means is to cluster on the basis of direction, so failing to normalize means you lose that property.

When is Classical K-Means preferred over Spherical K-Means?

Classical K-Means is often chosen when the magnitude of features is relevant, and a smaller Euclidean distance truly represents similarity. Typical scenarios include numerical feature vectors where the absolute values of features matter (for instance, in some sensor measurements, image color data, or continuous-valued regression-like features).

Can Spherical K-Means converge to different solutions based on initialization?

Yes, just like standard K-Means, spherical K-Means is sensitive to centroid initialization. Different initial points on the unit sphere can lead to different local optima. Common approaches like k-means++ can be adapted for spherical K-Means to produce more robust initial centroids, reducing the chance of poor local minima.

Are there any computational or performance differences?

The computational complexity is quite similar for both variants, as they both follow an iterative assign-and-update procedure. However, if the data is very high dimensional and you rely on sparse representations (common in text data), spherical K-Means can sometimes be more efficient because dot products with sparse vectors can be computed quickly. Meanwhile, in classical K-Means, you have to perform a Euclidean distance calculation which may require computing squared norms. In practice, the difference might be modest, and both are generally O(n * k * d * iterations), where n is the number of data points, k is the number of clusters, d is the dimensionality, and iterations is the number of iterations until convergence.

What are common pitfalls when using Spherical K-Means?

One pitfall is forgetting to normalize your data. Another is using spherical K-Means on data where the direction is not really the key factor in clustering. If the magnitude of features is genuinely important for your application, spherical K-Means may give misleading results. Additionally, as with classical K-Means, you need to choose k carefully, handle empty clusters, and consider the possibility of local minima.

Is there a way to interpret cluster centroids in Spherical K-Means?

Yes. The centroids in spherical K-Means are typically normalized vectors. You can look at the relative magnitude of each feature in the centroid vector to understand which features (for instance, which words in text data) characterize that cluster’s “direction.” This can be particularly helpful for text clustering to interpret which keywords are most relevant for each topic.

Rohan's Bytes

Discussion about this post