ML Interview Q Series: How do outlier detection and novelty detection differ, and how are they each typically applied in real-world scenarios?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Outlier detection is primarily concerned with identifying unusual samples within data that was already available when building a model. These data points deviate significantly from the majority of observations but belong to the same overall distribution as the rest of the training data. Novelty detection, by contrast, focuses on recognizing completely new or unseen patterns that were not part of the training set’s distribution at all. Below is a more detailed discussion of how they differ and when each might be used.
Outlier Detection
Outlier detection usually assumes that all training data (including potential anomalies) come from the same time period or the same generation process. The goal is to single out points that appear inconsistent or too extreme relative to the bulk of observations. This approach is commonly applied to historical datasets where anomalies might exist, and you want to catch them (for instance, identifying fraudulent transactions in a large dataset of past user transactions).
Key aspects:
The training set is assumed to contain outliers.
Methods often rely on the idea that most data points come from a single distribution, and outliers are rare and deviant from it.
Techniques can be distance-based (e.g., Mahalanobis distance, k-nearest neighbors distance) or density-based (e.g., Local Outlier Factor), or even isolation-based approaches (e.g., Isolation Forest).
Evaluation might leverage metrics like precision, recall, or area under the ROC curve, considering that these methods typically produce an outlier score for each data point.
Novelty Detection
Novelty detection deals with identifying points that were not present in the training data. In these scenarios, the training set is generally considered to be “clean” and free from anomalies. The focus is on building an accurate model that captures the normal data’s distribution so that if something novel appears at inference time, the model can recognize it as being different.
Key aspects:
The training set is assumed to have no outliers or anomalous instances.
The objective is to detect a fundamentally different type of data point (something that doesn't fit the normal patterns).
Applied to real-time monitoring tasks, such as fault detection in manufacturing systems or detecting new types of cyber attacks.
Methods often model the structure of normal data (e.g., via one-class SVM, autoencoders that reconstruct normal data well but not anomalies).
Differences in Data Assumptions
One crucial difference is the assumption regarding data distribution. Outlier detection inherently deals with a contaminated dataset, which includes both normal and “outlier” samples. Novelty detection assumes a clean training set representing only the “normal” distribution. Hence, the modeling strategy for novelty detection is more about learning the normal distribution boundary, whereas outlier detection tries to find abnormal points in a dataset that may already be noisy or contaminated.
Practical Illustrations
Outlier Detection Example: You have thousands of credit card transactions with a small fraction of fraudulent ones mixed in. You suspect that outliers might be fraud. You train a model on the entire dataset, aiming to isolate suspicious transactions.
Novelty Detection Example: You manufacture parts that are always supposed to meet some tolerance. You collect data from normal operations only (no defective parts). Later, when a new type of defect arises, you want your model to flag it because it doesn’t fit the learned definition of “normal.”
Typical Techniques
Outlier Detection:
Isolation Forest: Randomly partitions feature space to isolate outliers more quickly.
Local Outlier Factor (LOF): Measures local density deviation of a given data point from its neighbors.
Mahalanobis Distance: Particularly useful if data is assumed to be Gaussian. If a point is at a large Mahalanobis distance from the mean, it may be an outlier.
Sometimes the Mahalanobis distance is a core formula for identifying how far a point is from the mean in a multivariate Gaussian context. In that case, it is computed as:
Here
x
is the data point in question,mu
is the mean vector,Sigma
is the covariance matrix, and(x - mu)^{T}
is the transpose of(x - mu)
. A larger distance typically indicates a higher likelihood thatx
is an outlier.Novelty Detection:
One-Class SVM: Learns a boundary around normal data; points lying outside this boundary at inference time are deemed novel.
Autoencoders: Neural networks that learn to reconstruct normal samples, resulting in a high reconstruction error for novel or anomalous points.
Example Implementation in Python
Below is a short Python code snippet illustrating a basic one-class SVM approach for novelty detection:
import numpy as np from sklearn.svm import OneClassSVM # Suppose we have 'normal_data' only (no anomalies in training set) normal_data = np.random.rand(100, 2) # 100 samples, 2 features # Train one-class SVM model = OneClassSVM(kernel='rbf', gamma=0.1, nu=0.1) model.fit(normal_data) # Now test on new data (which might be normal or novel) test_data = np.random.rand(10, 2) predictions = model.predict(test_data) # -1 indicates outliers/novel points, +1 indicates inliers print("Predictions on test data:", predictions)
The above code trains a one-class SVM on a “clean” normal_data dataset. During inference (testing), any new sample that doesn’t fit within the learned boundary is flagged (prediction output = -1).
Common Follow-Up Questions
How would you handle a scenario where you are unsure if your training dataset is entirely clean or not?
When the dataset may contain outliers, you have to balance the approach of outlier detection (which can discover anomalies in a contaminated dataset) with the stricter requirements of novelty detection (which typically needs a clean dataset). In some real-life scenarios, you might attempt a two-step process:
Apply outlier detection on the initial data to remove extreme anomalies.
Retrain a novelty detection model on the presumably cleaned data.
Alternatively, you might rely on robust methods that can handle moderate amounts of contamination without drastically affecting the boundary of normal data. One-Class SVM with robust kernels or robust autoencoders can sometimes tolerate minor contamination.
What performance metrics are best suited for comparing outlier detection vs. novelty detection?
Both tasks often use similar evaluation strategies, but the distribution of anomalies or novel points might differ:
For outlier detection, you typically measure precision and recall on labeled data with known anomalies.
For novelty detection, you might introduce synthetic anomalies or new unseen patterns, then measure how reliably the system flags them. Metrics such as F1-score, AUPRC (Area Under the Precision-Recall Curve), or AUROC (Area Under the Receiver Operating Characteristic Curve) are common.
In novelty detection, the normal training data can be well-characterized, so you often measure how well the model generalizes to unseen normal data while detecting genuine novelties.
Could you combine outlier detection and novelty detection in a single pipeline?
Yes, in many practical cases you want both. A staged approach might help:
Stage 1: Outlier detection on historical data to filter out contaminants.
Stage 2: Novelty detection on newly incoming data to detect events that haven’t been seen before in any historical data.
This combination is particularly useful when you have large amounts of historical data that may already contain some anomalies, yet still want to detect genuinely new events in the future.
How do you choose between distance-based or density-based outlier detection methods?
It often depends on:
Dimensionality of the data: Distance-based methods can struggle with high-dimensional data due to the curse of dimensionality.
Data distribution: If you suspect clusters of varying densities, density-based methods (e.g., DBSCAN, LOF) might capture local differences better.
Computational constraints: Some methods are more expensive than others. Distance-based methods can scale poorly with large datasets, unless well-optimized (e.g., approximate nearest-neighbor search).
By carefully analyzing the data’s size, dimensionality, and distribution, you decide which approach is most likely to yield accurate outlier detection.
When would an autoencoder be favored for novelty detection over a one-class SVM?
Autoencoders are popular for tasks where normal data follows complex, potentially nonlinear distributions. They learn to compress and reconstruct normal data. If the distribution is high-dimensional or complicated (e.g., images, sensor readings, or other unstructured data), an autoencoder can capture those patterns more effectively than a simpler kernel-based approach.
One-class SVM, on the other hand, might be sufficient (and simpler) for moderately sized feature vectors if the data distribution can be well-separated with a proper kernel.
In practice, the choice can also be influenced by resource availability (autoencoders can be more computationally intensive to train) and data volume. With enough clean training data, an autoencoder might provide a very powerful boundary between normal and novel samples.
Below are additional follow-up questions
How would you handle concept drift or changing data distributions over time in outlier detection or novelty detection tasks?
Concept drift happens when the data distribution evolves in ways not originally captured by the training process. This can occur due to shifts in user behavior, seasonality, or even hardware changes in sensor systems. When detecting outliers or novelties, concept drift makes it challenging to distinguish between legitimate changes in the “normal” distribution and true anomalies or novel patterns.
One practical strategy to handle concept drift is to adopt incremental or online learning approaches. In an online setting, the model is continually updated with new observations to reflect the most recent state of normal data. For instance, you can maintain a sliding window of recent data points, re-estimating statistical parameters (e.g., mean, covariance) or retraining the model. This method ensures that when the normal distribution slowly shifts, the model’s internal representation of normal data adjusts in parallel, reducing the chance of erroneously labeling new but legitimate patterns as anomalies.
A potential pitfall occurs if you cannot confidently distinguish between gradual changes (which reflect genuine new normals) and abrupt changes that might correspond to real anomalies. You may inadvertently incorporate genuine anomalies into the model’s notion of normal if your update rate is too aggressive. On the other hand, if you update too slowly, your model might miss significant distributional shifts. Finding the right balance often involves trial and error, domain knowledge, or specialized drift detection algorithms (e.g., ADWIN for data streams).
If your outlier detection approach is producing too many false positives, how would you address that scenario?
Excessive false positives often arise because the model is too sensitive and flags borderline samples. Several strategies can mitigate this problem:
Adjust the Sensitivity Threshold: Most outlier detection algorithms produce a continuous outlier score. If you’re seeing too many false alarms, increasing the threshold for classification as an outlier can reduce false positives. However, the trade-off is an increased risk of missing genuine outliers (higher false negatives).
Refine Feature Selection: Irrelevant or noisy features can amplify random fluctuations in the data, causing benign points to appear anomalous. Through domain knowledge or techniques like principal component analysis, you can select features that better reflect true anomalies while reducing noise.
Incorporate Domain Constraints: Sometimes domain constraints clarify the boundary between normal and abnormal. For instance, if certain anomalies are only physically possible in certain ranges of sensor readings, leveraging these bounds can reduce false positives.
Use Ensemble Methods: Sometimes one algorithm alone can be too rigid. Combining multiple outlier detection methods (e.g., Isolation Forest + Local Outlier Factor) and flagging a point as anomalous only if multiple methods agree can reduce the likelihood of spurious alerts.
A critical pitfall is to keep tuning the model purely to fix false positives without validating whether you are inadvertently ignoring genuine anomalies. Balancing false positives and false negatives often requires close collaboration with domain experts, along with careful use of validation data that reflect real distributions and anomalies.
Can domain knowledge be integrated to enhance outlier or novelty detection? If so, how?
Domain knowledge can be crucial. A purely data-driven anomaly detection system might not capture subtle indicators or contextual factors specific to your domain. Domain expertise can guide feature engineering, helping you focus on measurements most relevant to anomalies. For instance, in a manufacturing process, knowledge that certain temperature readings deviate beyond physically plausible limits immediately flags a part, even if the statistical model might not deem it an outlier in a purely data-driven sense.
You might also integrate domain-based rules with model outputs. One approach is a hybrid system where the model provides an outlier or novelty score, and then domain-specific rules override or refine those scores under certain conditions. Another possibility is weighting features according to their known importance. For example, if domain experts believe changes in certain signals strongly correlate with failure modes, those signals might be emphasized more heavily during model training.
A pitfall in integrating domain knowledge is that some intuitive rules might contradict or overshadow the statistical patterns detected by the model, especially in complex, high-dimensional data. Ensuring synergy between domain-driven heuristics and machine learning methods requires iterative testing and feedback from experts.
How do you scale these methods to very large data sets in real-time or streaming contexts?
Scaling outlier or novelty detection typically involves balancing computational efficiency with detection efficacy. Some methods (like nearest-neighbor or naive density-based approaches) can become prohibitively slow for large datasets because they require pairwise distance computations or complex density estimations.
Approaches to handle scaling include:
Approximate Nearest Neighbor (ANN) Structures: KD-trees, ball trees, or specialized data structures help find neighbors more efficiently than brute-force search. For high-dimensional data, approximate methods like Locality Sensitive Hashing (LSH) can be used.
Online Algorithms: Instead of batch methods that require processing the entire dataset at once, online algorithms update model parameters as data arrives. This is crucial for streaming contexts where storage of all historical points might be infeasible. Isolation Forest, for example, has variants tailored for partial fitting on streaming data.
Distributed Computing: Techniques such as MapReduce or Spark-based implementations partition the dataset across multiple machines, enabling parallel computations. Randomization (as in random forests) can also reduce the overhead by sampling subsets of data.
A possible issue arises if you rely too heavily on approximate methods that inadvertently ignore small but crucial anomalies due to sampling or hashing collisions. You may also encounter challenges maintaining a consistent threshold for anomaly decisions across distributed systems, requiring synchronization or aggregated statistics.
Would you approach time-series data differently for outlier or novelty detection? If so, how?
Time-series data has temporal dependencies that static methods often overlook. While you can apply standard outlier detection to time-series data by treating each time step as an independent data point, you may lose the temporal context that can reveal important patterns.
Methods specific to time-series often involve:
Modeling the temporal structure with techniques like ARIMA or LSTM-based autoencoders to capture normal dynamic behavior.
Using sliding windows or sequence embedding. For example, you could transform each window of consecutive points into a feature vector and then apply outlier detection on those feature vectors.
Investigating seasonality and trends. If your data exhibit regular cyclicity (like daily or weekly periodic behavior), not accounting for that can lead to many false positives.
A subtle pitfall is that an anomaly in a single time step might not mean a real outlier unless it’s sustained over a certain duration or if it disrupts the sequence patterns in a meaningful way. Conversely, slow-developing anomalies (like gradual drift) can be missed by methods that only look for sudden spikes. Carefully choosing the time window size and step size for analysis is key, but it remains a challenging aspect because picking too short a window might miss context, whereas picking too long a window could dilute local anomalies.
Are there any specific considerations for interpretability and explainability when building outlier detection or novelty detection models?
Interpretability can be more complicated in anomaly detection because these models often rely on complex distance or density estimations. Stakeholders, especially in fields like finance or healthcare, frequently require justifications for why a point is flagged as anomalous.
Possible approaches to improve interpretability:
Feature Contribution Analysis: In methods like Isolation Forest or random forest–based outlier detection, you can analyze which features contribute most to splitting that isolates a data point.
Reconstruction Error Visualization: For autoencoders, you can plot reconstruction errors across different features or highlight which parts of an input (like certain pixels in an image) incur the largest error.
Local Explainers: Tools such as LIME or SHAP can be adapted for outlier detection methods by approximating a local decision boundary around the instance to clarify which features drive the anomaly classification.
A typical pitfall is that while local explanations might suffice for single data points, they may not provide holistic transparency into how the model identifies outliers across large, diverse datasets. Moreover, with novelty detection, providing rationales can be tricky because by definition, the model sees data that is fundamentally different from what it has trained on. Stakeholders could be dissatisfied with an explanation that only confirms something is “unlike anything seen before” without deeper reasoning tied to domain-relevant patterns.