ML Interview Q Series: How would you distinguish between out-of-distribution detection and anomaly detection?

Mar 21, 2025

Comprehensive Explanation

Out-of-distribution (OOD) detection and anomaly detection each deal with identifying inputs that deviate from what a model considers “normal.” However, these two tasks differ primarily in the nature of what constitutes “normal”:

Connect with me on X (Twitter)

Out-of-Distribution Detection This process checks whether an incoming sample belongs to the same overall distribution from which the model’s training data were drawn. If the new sample is likely not from that training distribution, it is classified as out-of-distribution. The goal is to guard against scenarios in which the model might encounter fundamentally different data than it ever saw during training (for example, a classifier trained on images of cars, trucks, and bikes being suddenly fed images of fruits).

Common approaches for out-of-distribution detection include:

Likelihood-based methods such as using a density estimator or a probabilistic model to estimate whether the probability density p(x) of the new sample is below a certain threshold.
Confidence-based methods in which the model’s predictive confidence for an input is compared to a threshold to detect if that input is foreign to the training distribution.
Feature representation methods that embed inputs into a latent space and detect if these embeddings deviate significantly from the distribution of embeddings of known classes.

Anomaly (or Outlier) Detection Anomaly detection focuses on spotting data points that do not conform to the expected patterns or behaviors in the dataset in which the model operates. Here, the assumption is that anomalies are rare but still part of the broader domain. In other words, an anomaly is typically an extreme or rare event within the known distribution, rather than a sample coming from a separate distribution altogether.

Common methods for anomaly detection include:

Statistical approaches (z-scores, robust statistics) to isolate points that deviate significantly from the bulk of the data.
Density-based clustering methods (like DBSCAN) to identify points in low-density regions as anomalies.
Autoencoder-based reconstruction error: in which anomalies are detected by evaluating if the reconstruction loss is abnormally high.

A key difference is that out-of-distribution detection tends to emphasize the distribution mismatch (“Is this data even from the same universe as my training set?”), whereas anomaly detection focuses on the rarities or unusual patterns within the known domain (“Is this data still from my distribution but just highly unusual?”).

Example of a Probability-Based Criterion

One way to detect OOD or anomalies is to learn a probability distribution of the training data p(x) (for instance, using a variational autoencoder or normalizing flow) and set a threshold epsilon. Any sample whose p(x) is below the threshold may be flagged as OOD or anomalous, depending on the context.

Where p(x) is the estimated probability for input x under the training data distribution. epsilon is a user-defined cutoff, typically chosen based on the desired false-positive vs. false-negative trade-offs. A sample with a probability below epsilon is flagged because it is judged not to be well-explained by the trained distribution.

Practical Python Example

Below is a simplified code snippet using a density estimation approach for anomaly detection. One could adapt it for OOD detection by changing how the density model is trained (perhaps using additional known in-distribution data).

import numpy as np
from sklearn.mixture import GaussianMixture

# Suppose we have training data representing the normal distribution
X_train = np.random.normal(loc=0.0, scale=1.0, size=(1000, 2))

# Fit a simple Gaussian Mixture Model
gmm = GaussianMixture(n_components=2, covariance_type='full', random_state=42)
gmm.fit(X_train)

# Suppose we have new data to test
X_test = np.array([[10, 10], [0.5, -0.1], [2, 2]])

# Compute log-likelihood of the test samples
scores = gmm.score_samples(X_test)

# Define a threshold
threshold = -10

# Flag data points that fall below the threshold
results = ["Anomaly/Out-of-Distribution" if s < threshold else "In-Distribution" for s in scores]
for x, r in zip(X_test, results):
    print(f"Data point {x} is {r}")

In this code:

GaussianMixture is used to model the distribution of training data.
For each test sample, a log-likelihood score is computed. Values below a threshold threshold are flagged as “Anomaly/Out-of-Distribution.”
By adjusting threshold or by changing how we fit our model, we can apply a similar mechanism to either OOD detection (if we have a well-defined training distribution and want to reject anything not from that distribution) or anomaly detection (if we treat the training data as mostly normal with rare anomalies).

Common Follow-up Questions

How do you choose an appropriate threshold for distinguishing between in-distribution vs. out-of-distribution?

Threshold selection often depends on the trade-off between false positives (flagging something as OOD when it is in-distribution) and false negatives (failing to flag something as OOD when it truly is). Techniques include:

Using validation data from a known in-distribution to estimate typical probability scores.
Introducing a small sample of data from distributions you want to reject, if possible, to determine typical out-of-distribution scores.
Employing metrics such as the Receiver Operating Characteristic (ROC) curve or Precision-Recall curves to find a threshold that meets operational requirements.

What if the training data itself contains anomalies?

If anomalies exist in the training data, the anomaly detection system’s notion of “normal” becomes corrupted. Steps to address this include:

Using robust statistics or an iterative process to identify and remove significant outliers before final model training.
Employing semi-supervised approaches if labels for normal vs. abnormal data are partially available.
Monitoring internal metrics, such as reconstruction error (for autoencoders) or mixture model component responsibilities, to detect and exclude outliers in the training phase itself.

How do real-world domain shifts affect out-of-distribution detection?

In practice, domain shifts often happen gradually rather than abruptly. For example, a camera feed may slowly change lighting conditions over time. If the shift is slow, a strict OOD approach might repeatedly flag new data as out-of-distribution. Solutions include:

Periodic model re-training or fine-tuning on newly collected data to update the in-distribution representation.
Incorporating drift detection methods that distinguish between incremental shifts and abrupt distribution changes.

Can the same model be used for anomaly detection and OOD detection?

Yes, in many cases the underlying algorithm for computing a “degree of normality” is similar. For anomaly detection, the emphasis is on rare events within the training distribution. For OOD detection, the emphasis is on identifying distributions that differ entirely from training. By adjusting how the model is trained and how you sample your training data, you can shift the model’s focus from anomalies within a distribution to unknown distributions beyond the training data.

How would you design an experiment to measure the performance of an OOD detector?

You typically gather:

Data from the in-distribution (the same distribution used to train the model).
Data from one or more external distributions that the model should reject.

Steps:

Train your OOD detector on the in-distribution only (or a combination of in-distribution and known “near OOD” data).
Evaluate the True Positive Rate of detecting OOD samples across multiple external distributions and the False Positive Rate on the in-distribution data.
Use metrics like AUROC (Area Under the ROC curve) to summarize performance.

This approach helps demonstrate how robustly your model rejects truly out-of-distribution examples while not over-flagging valid samples.

Why might a model assign a high likelihood to out-of-distribution samples?

Some models, particularly certain deep generative models (like VAEs trained on complex datasets), can assign high likelihood scores even to samples quite different from their training distribution. This can occur due to:

The mismatch between how the likelihood is computed and how humans perceive similarity.
Particular biases in the generative model architecture or loss function.

Mitigation strategies:

Incorporate more sophisticated scoring functions (for example, combining reconstruction error and latent space activation statistics).
Add explicit density ratio techniques (like comparing the likelihood on the real model vs. a background model) that reduce miscalibration.

These discussions deepen the understanding of how OOD detection and anomaly detection relate, how they differ, and the best practices for deploying them in real-world scenarios.

Below are additional follow-up questions

How do you handle partial or uncertain labels in anomaly detection?

Sometimes you have a small number of labeled normal samples and few (or no) labeled anomalies. In such cases, you can adopt semi-supervised approaches that incorporate partial label information about what is definitely normal. One tactic involves training a model on confidently “normal” data to learn an approximate representation of expected behaviors and then identifying anomalies as deviations from this representation. Another practical solution is active learning, where the model flags high-uncertainty points for human annotation. Over multiple iterations, this refines the labeled dataset and reduces misclassification of borderline cases.

A typical pitfall arises when the few labeled anomalies fail to represent all possible abnormal scenarios. The model might overfit to known anomaly types and overlook unseen anomalies that differ significantly. To mitigate this, you can:

Regularly augment your training set with newly discovered anomalies once they are confirmed as truly anomalous.
Employ methods like one-class SVMs or deep one-class classification that do not rely heavily on anomaly labels, but rather learn a boundary around normal data.

How do multi-modal distributions complicate out-of-distribution or anomaly detection?

Multi-modal data distributions contain multiple clusters or “modes” within the same domain. This can make it challenging for models that assume unimodality (like a single Gaussian). If the model focuses on one dominant cluster, it may incorrectly label valid data from a different cluster as anomalous or out-of-distribution.

To address these challenges:

You can use mixture models or autoencoders that can capture complex, multi-modal behavior in the latent space.
You can cluster the data (for example, K-Means or DBSCAN) to identify multiple modes and train separate detectors, each specialized for a specific mode. By comparing a new sample’s similarity to each mode, you can decide if it fits well with at least one cluster.
Consider domain knowledge on how many distinct modes are expected (such as multiple operating conditions for industrial machinery). This provides guardrails on modeling and helps calibrate the approach.

How do you handle real-time or online out-of-distribution detection?

Many practical applications, like fraud detection or sensor monitoring, require real-time alerts as data arrives continuously. Online OOD detection or anomaly detection differs from the offline setting in that data must be processed sequentially with minimal latency and possibly with concept drift over time.

Typical methods:

Incrementally updating models (for example, updating a Gaussian Mixture Model’s parameters or using online clustering algorithms) so they stay current with the latest distribution.
Sliding window approaches, where you maintain a window of recent data points to continuously re-estimate normal behavior.
Drift detection methods (like ADWIN or DDM) that signal a significant distribution shift. Once detected, you can either retrain from scratch or adapt the existing model.

A major pitfall is that real-time detection systems may quickly become outdated if the data distribution shifts faster than the model can adapt. Balancing stability (not constantly retraining on fleeting outliers) with plasticity (updating in response to genuine shifts) is crucial.

How do you evaluate anomaly detection systems beyond simple metrics like accuracy?

Accuracy can be misleading in anomaly detection if anomalies represent a tiny fraction of all data. A system might score high accuracy simply by labeling almost everything as normal. More informative metrics include:

Precision-Recall for anomalies, emphasizing how many flagged events are truly anomalous (precision) and how many true anomalies are detected (recall).
Confusion matrix analysis focusing on false positives (costly if you frequently investigate false alarms) and false negatives (dangerous if anomalies go unnoticed).
Area Under Precision-Recall Curve (AUPRC) or Area Under ROC Curve (AUROC), which track performance across varying thresholds.

When anomalies have varying severity (for example, mild vs. critical anomalies), you can incorporate domain-specific cost functions. In such a scenario, mislabeling a severe anomaly can be assigned a higher penalty. A hidden pitfall is ignoring the domain context and business consequences: a single missed catastrophic anomaly can be worse than many false alarms.

How can domain knowledge be integrated effectively into out-of-distribution or anomaly detection?

Domain knowledge helps refine which features or representations are most relevant, how to set thresholds, and which patterns are truly critical. For instance, in medical imaging, clinicians might know typical ranges for certain biomarkers, enabling more targeted anomaly detection.

Practical integration strategies:

Feature engineering guided by domain expertise, ensuring the model sees variables most indicative of normal vs. anomalous states.
Custom loss functions where domain insights dictate that certain anomalies carry higher costs, thus emphasizing detection.
Post-processing steps to filter or re-rank flagged anomalies based on known constraints or business rules.

A subtle pitfall is over-reliance on domain knowledge that might be incomplete or outdated. Overly prescriptive domain rules can blind the algorithm to emerging or previously unknown patterns, reducing its ability to catch novel anomalies.

How do you handle high-dimensional data where anomalies may be subtle?

When dealing with high-dimensional data (e.g., large-scale IoT sensor arrays, gene expression profiles, or unstructured text), anomalies might be localized to small subspaces. Traditional distance-based approaches can suffer from the “curse of dimensionality,” making distances less meaningful.

Potential strategies:

Dimensionality reduction via autoencoders or principal component analysis to capture core data structure. Anomalies often have a higher reconstruction error or fall outside the learned manifold in latent space.
Local methods such as Local Outlier Factor that measure the density relative to neighbors in a local region.
Deep learning approaches leveraging specialized architectures (like CNNs for images or RNNs for time-series) to automatically learn feature embeddings, with subsequent detection of deviations in that lower-dimensional representation.

A common pitfall is incorrectly choosing a single global dimension reduction technique for data that might have multiple distinct sub-manifolds. If your dimensionality reduction merges these manifolds, you may inadvertently label valid but less common points as anomalies. Testing on comprehensive, well-sampled data from each sub-manifold helps mitigate this.

How do you handle label noise or inconsistent labeling in the training set for anomaly detection?

When labels are noisy, the model’s notion of normal or anomalous behavior can become compromised. This could happen if anomalies are mislabeled as normal or vice versa, especially when the labeling process is manual or relies on heuristics.

Possible solutions:

Use robust training objectives, such as a noise-tolerant loss function or ensemble methods where multiple detectors vote on whether a point is anomalous.
Implement a data-cleaning phase in which suspiciously labeled samples are re-examined. For example, high disagreement between multiple models can signal that a label should be rechecked.
Use active learning to query human experts for the most dubious cases, iteratively refining the dataset.

One subtlety is that not all “noise” is truly incorrect labeling. Sometimes borderline points do not clearly belong to normal or anomalous classes. Properly managing these edge cases often involves threshold tuning or adopting probabilistic anomaly scores rather than binary decisions.

Rohan's Bytes

Discussion about this post