ML Interview Q Series: How can Mahalanobis distance serve as an approach for detecting anomalies in a dataset?

Mar 22, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

Mahalanobis distance is a generalized measure of distance that considers correlations between features. It is frequently used to spot outliers, since it takes into account the scale and correlation structure of the dataset more accurately than simple Euclidean distance. In essence, points that deviate substantially from the majority of the data with respect to the covariance structure are identified as potential anomalies.

Connect with me on X (Twitter)

Key Formula for Mahalanobis Distance

Here, x represents the data point of interest (an n-dimensional column vector). The term mu is the mean vector of the distribution, also in n-dimensional space. The symbol Sigma denotes the covariance matrix of the distribution, and Sigma^{-1} is its inverse. The expression (x - mu) captures the difference between the data point x and the mean mu, and the resulting distance d^2 is a scalar that reflects how far the point is from the mean in terms of the dataset’s covariance structure.

In an outlier detection setting, one often calculates this Mahalanobis distance for each point and then flags points that exceed a threshold as outliers. Because it incorporates the inverse covariance matrix, the distance penalizes deviations along directions of less variance more heavily and is more permissive along directions of greater variance.

Practical Steps for Outlier Detection

One commonly used strategy is:

Estimate mean mu of your dataset.
Estimate the covariance matrix Sigma of your dataset.
Compute Sigma^{-1} if invertible.
For each sample x_i, calculate its Mahalanobis distance.
Determine a threshold. For instance, under the assumption of multivariate normality, the squared distance d^2 follows a Chi-square distribution with n degrees of freedom. One can pick a significance level (like 0.99 or 0.999) to define a cutoff. Any point with a distance beyond this cutoff is labeled as an outlier.

Challenges and Considerations

Large dimensionality can lead to a near-singular covariance matrix, making Sigma^{-1} difficult to compute.
If the data does not follow a multivariate Gaussian distribution, the assumption of Chi-square thresholds might not be strictly valid, although it can still provide a practical heuristic.
If correlations in features are not accurately captured, the measure may fail to discriminate true outliers from the rest.

Example Implementation in Python

Below is a brief code snippet to illustrate how one might compute Mahalanobis distance for outlier detection:

import numpy as np
from scipy.stats import chi2

def mahalanobis_outlier_detection(X, alpha=0.99):
    """
    X: data array of shape (num_samples, num_features)
    alpha: confidence level to determine outlier threshold
    """
    mu = np.mean(X, axis=0)  # mean vector
    cov = np.cov(X, rowvar=False)  # covariance matrix
    cov_inv = np.linalg.inv(cov)   # inverse covariance matrix

    distances = []
    for x in X:
        diff = x - mu
        d_sq = np.dot(np.dot(diff, cov_inv), diff.T)
        distances.append(d_sq)

    # Chi-square threshold for outlier detection
    threshold = chi2.ppf(alpha, X.shape[1])

    # Label points as outliers if distance > threshold
    outlier_labels = [1 if d_sq > threshold else 0 for d_sq in distances]

    return np.array(distances), np.array(outlier_labels), threshold

This code calculates the squared Mahalanobis distance for each point in X, then uses a Chi-square distribution to derive a cutoff. Observations whose distance exceeds this threshold are identified as outliers.

How It Differs from Euclidean Distance

Euclidean distance measures how far a point is from the mean in each dimension, but it does not account for correlations among features. Consequently, it may inflate or understate the actual distance if features are highly correlated or have differing scales. Mahalanobis distance solves this by incorporating covariance, highlighting anomalies more effectively.

Potential Instability in High Dimensions

When the number of features is large relative to the number of samples, the sample covariance matrix can become singular or near-singular. This leads to numerical instability when inverting the covariance matrix. Possible solutions involve:

Regularizing the covariance matrix (e.g., adding a small value lambda to the diagonal).
Using dimensionality reduction techniques such as PCA to reduce dimensionality before computing the distance.
Employing robust covariance estimators that are less sensitive to outliers during the estimation process.

The Significance of the Threshold

A typical approach sets the threshold based on a theoretical Chi-square distribution if the data is approximately Gaussian. However, real-world data may deviate from Gaussian assumptions. Practitioners often choose thresholds empirically by cross-validation or by investigating the distribution of distances.

What if Data is Multi-modal?

If data arises from multiple underlying distributions, a single global covariance matrix and mean might not capture the true structure. One can adopt:

A mixture model approach, fitting multiple means and covariance matrices.
Local outlier detection methods where covariance matrices are computed for local neighborhoods.

Addressing Data Without a Gaussian Assumption

Mahalanobis distance can still be applied in non-Gaussian contexts, but the direct Chi-square threshold might be less reliable. One might rely on percentile-based cutoffs from the empirical distribution of distances or use other robust methods.

Additional Questions

Could there be numerical issues with large covariance matrices?

Yes. In high-dimensional settings or with limited data, the sample covariance matrix might be singular or nearly singular. Regularization is usually done by adding a small identity term to the covariance matrix to ensure invertibility. Dimensionality reduction can also help mitigate these issues.

How do we handle missing values when calculating Mahalanobis distance?

One approach is to impute missing data before computing the distance. Alternatively, methods that directly handle missing data in the covariance estimation step can be used. Care should be taken that imputation does not distort the covariance structure.

What happens if the features have very different scales?

Mahalanobis distance naturally accounts for scaling through the covariance matrix. When a feature has a large variance, the distance penalizes anomalies less in that dimension. If features vary by several orders of magnitude, standard normalization or standardization is still recommended before covariance estimation to stabilize the computations.

How can we validate the chosen threshold?

One possibility is to use a validation dataset with known outliers (if available). Another is to use domain knowledge. Practitioners might analyze the distribution of computed distances and choose a cutoff that best isolates unusual points, iterating and checking the results for reasonableness.

Below are additional follow-up questions

How do we handle situations where outliers themselves strongly affect the estimated mean and covariance matrix?

When using Mahalanobis distance for outlier detection, the estimates for mean (mu) and covariance (Sigma) are often derived from the entire dataset. If there are extreme outliers, they can exert a disproportionately large influence on both mu and Sigma, thus skewing the results. A robust approach to mitigate this effect is to use robust estimators:

Robust Mean and Covariance Estimation: Instead of a simple arithmetic mean, methods like the median or other robust statistics can be used. Similarly, robust covariance estimation techniques (e.g., Minimum Covariance Determinant) help downweight the impact of outliers.
Iterative Outlier Removal: One iterative approach is to compute an initial estimate of mean and covariance, detect outliers, remove them, and then recalculate mean and covariance on the reduced data. Care must be taken not to remove true data points incorrectly in the early stages of iteration.
Regularization: Adding a small constant to the diagonal of the covariance matrix can reduce the effect of a few extreme points but does not fully solve the problem of misestimated means.

The key pitfall is that naive estimation can cause the normal points to appear more outlier-like or outliers to be missed when the distribution is heavily tailed or contains many extreme points.

How do we adapt Mahalanobis distance for data streams or dynamically changing distributions?

In streaming scenarios or non-stationary environments, recalculating the full covariance matrix and its inverse every time new data arrives can be computationally expensive and may not reflect the most recent underlying distribution:

Incremental Mean and Covariance Updates: Algorithms like the online update of mean and covariance can be used. This involves updating estimates on-the-fly without reprocessing the entire dataset.
Forgetting Factor: A forgetting factor (0 < factor < 1) can discount older data over time. This helps track a distribution that shifts and prevents the model from being overly influenced by outdated observations.
Sliding Window: Another approach is to maintain a fixed-size window of recent data for mean/covariance estimation. As new data comes in, old data is discarded. This helps adapt to changes but can cause abrupt shifts if the window size is too small.

An important pitfall here is choosing a suitable strategy for discarding or discounting old data. Too aggressive an approach can lead to volatile estimates, while being too conservative will lag behind the real-time changes.

How can we apply Mahalanobis distance if some features are categorical?

Mahalanobis distance relies on covariance matrices, which assume numerical (continuous) features:

Encoding Categorical Variables: One method is to convert categorical features to numerical (e.g., one-hot encoding or embeddings). However, one-hot encoding can increase dimensionality, leading to potential rank-deficiency in the covariance matrix.
Distance Definitions for Mixed Data: There are specialized methods to handle mixed data types. One approach is to compute separate distance measures for continuous and categorical parts and then combine them. Alternatively, advanced statistical methods (e.g., Gower distance) can incorporate both numeric and categorical features more directly.
Feature Engineering: Sometimes, categorical features can be transformed into meaningful continuous statistics (e.g., frequency encodings or target encodings). This transformation, while powerful, can introduce bias or leakage if done naively.

A hidden pitfall is that naive encoding of many category levels can inflate dimensionality and degrade the stability of covariance matrix inversion.

Can Mahalanobis distance identify multi-dimensional outliers that are not extreme along any single dimension but are unusual in combination?

Mahalanobis distance is particularly designed for spotting precisely such points that appear normal on each dimension individually yet form an odd combination when considered jointly:

Correlation-Aware Detection: Because it uses the covariance matrix, Mahalanobis distance can catch points that deviate along correlated directions.
Example: Suppose two features are strongly correlated (e.g., as one goes up, so does the other). If a point lies in a region where one feature is high while the other is low, it might be flagged as outlier even if neither dimension individually is extremely large or small.

A potential pitfall is that if the correlation structure is not estimated accurately (due to insufficient data or the presence of existing outliers), the advantage of multi-dimensional detection can be undermined.

How do we ensure that the covariance matrix remains invertible or well-conditioned in low-sample or high-feature scenarios?

When the sample size is smaller than or comparable to the number of features, the sample covariance matrix may become singular or ill-conditioned:

Dimensionality Reduction: Techniques such as PCA can be applied prior to computing the covariance, reducing the feature space to a stable dimension.
Regularization: Adding a small term lambda to the diagonal of the covariance matrix (i.e., cov + lambda * I) ensures the matrix is invertible.
Sparsity-Inducing Methods: Some estimators encourage a sparse covariance matrix by penalizing off-diagonal elements, beneficial for high-dimensional data.

A subtle challenge is how to choose the dimension to which we reduce the data or the amount of regularization. Over-reducing or over-regularizing can strip away important signals, while under-regularizing leaves us with an ill-conditioned problem.

Is it feasible to use Mahalanobis distance when data is highly non-Gaussian or heavy-tailed?

Mahalanobis distance is most directly linked with the assumption of (at least approximately) multivariate Gaussian data, because thresholds are typically based on the Chi-square distribution. However, real-world data often exhibit heavy tails or are far from Gaussian:

Empirical Thresholding: Instead of relying on Chi-square quantiles, one can compute the empirical distribution of distances and use a percentile-based cutoff.
Transformations: Sometimes, transforming the data (e.g., using log transforms for skewed data) can bring it closer to normality, making the Gaussian-based assumptions more reasonable.
Robust Estimators for Non-Gaussian Data: Methods like M-estimators and other robust covariance estimates help account for heavier tails. They can still output a covariance matrix for use in Mahalanobis distance, but the threshold might be chosen differently.

One pitfall is ignoring the distribution’s shape entirely and applying a Chi-square-based threshold, leading to either too many or too few outliers flagged.

How does one interpret or visualize Mahalanobis distance results for high-dimensional data?

Interpreting outliers in high-dimensional space can be challenging:

Distance Histogram or QQ-Plot: Plot the distribution of Mahalanobis distances for all points. If the data is approximately Gaussian, distances should align with the theoretical Chi-square quantiles in a QQ-Plot.
Dimensionality Reduction for Visualization: Applying t-SNE, UMAP, or PCA to project data to 2D/3D can help visualize outlier points flagged by Mahalanobis distance.
Feature Contribution: By examining each dimension’s contribution to the distance term (via the (x - mu) * Sigma^{-1} * (x - mu) formulation), one can glean which features contribute the most. This can be done by partial derivatives or by local approximations of the distance measure.

A key pitfall is relying solely on raw Mahalanobis distance numbers without investigating which features are driving an observation to be labeled as an outlier, leading to limited explainability.

How can we handle real-time alerts or feedback loops in practical systems?

In real-world applications, once an outlier is detected, the system often triggers alerts or decisions:

Human-in-the-Loop: Analysts or domain experts can provide feedback, confirming whether flagged points are truly anomalous. Over time, this feedback can refine or retrain the outlier detection model.
Adaptive Thresholds: After a pattern of false positives or missed outliers is observed, thresholds or the estimation procedure can be tuned.
Drift Detection: If outliers become frequent in a particular region of the feature space, it may indicate that the distribution itself has shifted, and the model must be retrained or updated.

A subtlety arises when the system’s behavior changes in response to outliers. For example, if an outlier triggers a process that modifies future data collection, the distribution can shift further, creating a feedback loop that must be accounted for.

Does weighting samples differently impact Mahalanobis distance?

In certain scenarios, not all points are equally important. For example, recent samples might be more relevant, or some classes might need stronger emphasis:

Weighted Covariance: One can compute a weighted covariance matrix where each sample contributes in proportion to its assigned weight. The formula for mean and covariance estimation is adjusted accordingly.
Practical Use Cases: Assigning higher weight to recent samples in a time series or giving lower weight to uncertain data.
Impact on Inversion: Weighted covariance can be better-conditioned if carefully handled, but if extremely large or small weights are used, the matrix might still be difficult to invert.

A potential pitfall is incorrectly applying weights in the covariance matrix calculation, leading to inconsistent estimates. Ensuring the sum of weights is appropriately normalized and that no single point dominates is crucial.

Rohan's Bytes

Discussion about this post