ML Interview Q Series: In an anomaly detection context, how would you describe the concept of change detection and its significance?

Mar 22, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

Change detection, in the realm of anomaly detection, refers to identifying points in time (or sequences of data) when the underlying behavior, distribution, or pattern of the data shifts in a meaningful way. Unlike looking for a single unusual event or outlier, change detection focuses on whether the process generating the data has transitioned from one stable regime to another. The idea is to spot if a system has changed its behavior – for instance, a manufacturing process might go from producing normal output to producing defective items, or a network might shift its traffic patterns due to a new type of user behavior.

Connect with me on X (Twitter)

This problem is particularly common when dealing with streaming data or time series. Rather than simply flagging individual samples as anomalies, the goal is to determine if the system’s statistical properties have altered. Often, the methods can show whether the change is abrupt or if it gradually evolves over time.

There are statistical approaches such as CUSUM (Cumulative Sum), Bayesian online change-point detection, and distribution-based methods that look at divergences (like KL divergence) between recent data windows and historical data windows. One popular technique, the CUSUM algorithm, tracks how data points deviate from an expected baseline. When the deviation accumulates beyond a threshold, the algorithm signals a change.

Below is a core formula often used in the context of CUSUM:

where:

S_{n} is the cumulative sum of deviations up to time n (reset to 0 if it becomes negative).
x_{n} is the observed value at time n.
mu is the expected mean of x_{n} under a normal (pre-change) condition.
k is a reference value (often related to allowable drift in the mean).

In many real-world scenarios, we choose mu to be the process mean before the change, and k is set to control how sensitive the detection should be. If x_{n} starts deviating significantly above or below mu, S_{n} grows until it surpasses a threshold, indicating a change.

This is in contrast to one-time outlier detection, where one typically checks if a single data point or a small batch is anomalous relative to the rest. Change detection instead focuses on discerning a long-term or systematic shift. It is critical in applications like fraud detection for continuous transactions, network intrusion monitoring, predictive maintenance in industrial systems, and anywhere data streams might evolve.

Incremental or online algorithms can handle data as it arrives, adjusting their estimate of the data’s distribution on the fly. Change detection can also be combined with machine learning models: for example, a neural network might periodically re-check its loss distribution to see if it has significantly shifted. This can indicate the need to retrain the model or adapt to new behavior in the data.

What is the relationship between change detection and concept drift?

Concept drift is a term often used in machine learning, especially in streaming or online settings, referring to the phenomenon where the statistical properties of the target variable or input features change over time. Change detection is closely related because concept drift can be detected by identifying when the data distribution shifts from one regime to another. If concept drift is abrupt, it is almost exactly what we call a “change point.” If it is gradual, specialized drift detection methods can still be considered variations of change detection techniques.

An important point is that some concept drift might not reflect a genuinely abrupt transition but a slower evolution. In these cases, standard abrupt change detection methods might produce too many false positives or delayed responses, so more sophisticated algorithms might be required.

How do we differentiate between outlier detection and change detection?

Outlier detection tries to spot data instances that deviate from expected behavior on a sample-by-sample basis. It usually involves modeling what is “normal” for a single point (or a local group of points) and determining if a new observation falls outside that boundary.

Change detection focuses on whether the overall process is the same as it used to be. In other words, even if each individual point might look normal, if the mean or distribution subtly shifts over time, change detection can catch it. Conversely, an isolated outlier might not be flagged by a change detector if it doesn’t significantly alter the underlying distribution.

In practice, both can be used together. Some scenarios benefit from quickly spotting stray anomalous points while also flagging broader changes in patterns that indicate deeper issues in the data or system.

How can we implement a simple change detection approach in Python?

A common minimal approach is to use a rolling window of data points and compare the distribution of the new window to a reference distribution. One might use a two-sample test (like the Kolmogorov–Smirnov test) to determine if the distribution has shifted sufficiently. Below is an example:

import numpy as np
from scipy.stats import ks_2samp

def detect_change(data_stream, window_size=50, alpha=0.05):
    changes = []
    ref_window = data_stream[:window_size]

    for i in range(window_size, len(data_stream) - window_size):
        current_window = data_stream[i:i+window_size]
        stat, p_value = ks_2samp(ref_window, current_window)
        if p_value < alpha:
            changes.append(i)
            ref_window = current_window
    return changes

# Example usage:
data = np.random.normal(0, 1, 200)   # stable distribution
data[100:] = np.random.normal(2, 1, 100)  # shift mean after index 100
change_points = detect_change(data, window_size=20, alpha=0.01)
print("Detected change points:", change_points)

In this snippet, we maintain a reference window and slide a current window through the data stream. We use the Kolmogorov–Smirnov test to see if there is a statistically significant difference between the distributions of the two windows. Upon detecting a significant p-value below alpha, we mark a change point and reset the reference window to the current segment. This approach is simple but can work well in scenarios where a distribution shift is abrupt and the data is not excessively large.

More advanced algorithms such as Bayesian online changepoint detection or specialized libraries offer more sophisticated handling of gradually evolving data or multiple potential change points.

Potential pitfalls include choosing the right window size, dealing with very noisy data, and ensuring that the test used is appropriate for the type of distribution shift (e.g., mean shift, variance shift, or more general changes in distribution shape).

What are common real-world challenges with change detection?

One challenge is noise and variance in data. If data naturally exhibits high variability, standard threshold-based methods might trigger false alarms. Another issue is the possibility of repeated small fluctuations that do not necessarily represent genuine changes but might be flagged by very sensitive algorithms. Parameter tuning (such as the threshold in the CUSUM method or alpha in a statistical test) is crucial to balance missed detections against false positives.

Another challenge is distinguishing between short-lived anomalies (which might revert back to normal) and genuine distribution shifts. If the data returns to its original pattern, it might be an outlier event rather than a true change point. Real-world systems might show partial changes, cyclical behavior, or drifting trends that evolve slowly, requiring more nuanced detection strategies.

Could neural network-based approaches be used for change detection?

Neural network-based approaches can be employed in a few ways. One way is to train an autoencoder or other reconstruction-based model on a historical dataset and monitor the reconstruction error over time. A sustained change in the reconstruction error distribution can suggest that the model no longer fits the data, pointing to a drift or a shift in the generating process.

Another approach uses online learning methods where the network is continuously updated with new data. By monitoring the discrepancy between predictions and observed outcomes, one can detect sudden changes in this discrepancy. Some practitioners also use embedding techniques, applying dimensionality reduction or feature extraction, and then applying standard statistical or distance-based change detection algorithms on those latent representations. In all these cases, hyperparameter tuning and robust continuous validation are critical to avoid under- or over-detecting changes.

How do we handle abrupt changes vs gradual changes?

Abrupt changes occur sharply and are often easier to detect with threshold-based methods like CUSUM or a direct statistical test. Gradual changes are more subtle and can fool an abrupt change detector by never crossing the threshold in a single jump. For gradual changes, one strategy is to use adaptive windowing or a forgetting factor that weighs newer data more heavily. This allows the model or statistic to slowly adapt, but still enables us to detect if the new data is diverging significantly from older patterns.

Methods that rely on a fixed reference window, like a static baseline distribution, might struggle with gradual drift because the difference between the reference window and current window becomes significant only after many small changes accumulate. Adaptive or online algorithms that are designed to track drift on a continuous basis tend to be more effective.

How do we choose thresholds or hyperparameters for change detection?

Thresholds and hyperparameters typically need to be chosen based on domain knowledge and by analyzing trade-offs between false alarms and missed detections. In some contexts, false positives might be acceptable if missing a real change is very costly (for example, critical manufacturing processes). In other contexts, repeated false alerts might cause alarm fatigue. Practitioners often use validation data with known change points, or they employ offline simulations, to calibrate parameters such as alpha in statistical tests or the threshold in CUSUM. Cross-validation strategies, where we artificially inject known distribution shifts into the training or validation data, can help tune these parameters more systematically.

When domain knowledge is limited, more adaptive methods can be used, but tuning remains essential. For instance, in the CUSUM formula, the choice of mu and k is based on assumptions about how big a deviation from the process mean is considered significant. In statistical tests like the Kolmogorov–Smirnov test, alpha can be adjusted to set the detection sensitivity.

How does one handle multivariate or high-dimensional data when performing change detection?

Multivariate settings make change detection more challenging because the shift might occur in a combination of features or sub-manifolds of the data space. Simple univariate tests or thresholds might fail to capture correlated shifts. Techniques include:

Using dimensionality reduction or representation learning to project data into a more tractable space, and then applying standard univariate or low-dimensional change detection.

Using distance-based metrics (like Mahalanobis distance) to measure how far the current data distribution is from the historical one in the original high-dimensional space.

Using specialized multivariate statistical tests or machine learning models that can capture joint distributions across multiple variables. For example, a deep neural network approach might produce embeddings that summarize correlations among features, which can then be monitored for distributional shifts.

These methods often require more computation and more data to reliably detect changes, and they can be prone to overfitting or spurious detections if the number of features is very large compared to the sample size.

How can we be sure a detected change is genuinely meaningful?

Statistical significance tests (like the KS test or a likelihood ratio test) can assign p-values or confidence intervals to potential change points. This helps ensure that changes are not random fluctuations. Additionally, domain knowledge is critical for interpreting the nature of the change. For instance, a detected change in a streaming sensor might be due to routine maintenance, temperature fluctuations, or a legitimate breakdown. A successful detection framework usually incorporates feedback from subject matter experts to confirm that changes flagged by the system correspond to actual system events or issues.

Combining multiple signals can further increase confidence in a change detection. For instance, in a network monitoring scenario, a distribution shift in one type of traffic might not be enough, but if that shift is correlated with abnormal firewall logs and suspicious user behavior, the combined evidence strongly suggests a genuine change in network usage patterns.

Is there a connection between change detection and model retraining?

In many real-time or online machine learning pipelines, detecting a change can serve as a trigger to retrain or update the model. If a fundamental shift in the data distribution has occurred, continuing to use the old model may degrade performance. Trigger-based retraining can be more resource-efficient than continuously retraining on every incoming data point, but it relies on reliable detection. Once a change is detected, a pipeline might perform the following steps:

Gather sufficient new data in the post-change regime.

Re-train (or fine-tune) the model on this updated dataset.

Validate that performance metrics are restored to acceptable levels.

In some cases, a system might keep a library of models, each specialized to a different regime. When a change is detected, the system either activates a more appropriate model or trains a new one to handle the new distribution.

What if we suspect multiple sequential changes?

In streams with multiple changes over time, the system can pass through several regimes. Each time a change is detected, you can reset the baseline distribution or training set to the most recent data and continue. However, if changes are too frequent or if the data is highly non-stationary, repeated resetting might cause the model or statistical baseline to chase noise instead of capturing meaningful stable states. Techniques like hierarchical change detection or more advanced Bayesian online inference methods can handle the possibility of multiple changes by maintaining beliefs about how many changes have occurred and where they are likely located. This allows a system to better segment the data stream into intervals of consistent behavior.

Could you provide an example of a tricky scenario in change detection?

A tricky situation is one in which the data distribution changes temporarily and then returns to its original state. This might happen if there is a brief special event (e.g., a promotional sale in an e-commerce platform). The question is whether to interpret this as a legitimate distribution change or a transient anomaly. If a standard algorithm triggers a change detection and subsequently re-establishes a baseline on the new regime, it might misinterpret normal data as a drift afterward. Careful consideration of domain context is needed to handle transient shifts. One might designate known cyclical or seasonal phenomena as “normal” to avoid repeated false alarms.

Another challenge arises if the data dimension is large and only a small subset of features actually undergoes a change. Simple global statistics might not see the difference, or might dilute the effect across many unchanged dimensions. This scenario highlights the importance of feature-level analysis or more sophisticated distribution comparisons that can capture localized shifts in specific feature subsets.

What are some advanced methods for handling streaming data with changes?

Advanced methods include Bayesian online change detection, which continuously updates a posterior distribution of possible change points, and methods that integrate adaptive learning of model parameters with a forgetting factor or smoothing approach. There are also ensemble-based methods that maintain multiple candidate models; when performance degrades, the system shifts its weight to a better-performing model. These techniques balance stability and plasticity, ensuring that they do not overreact to minor fluctuations but can still adapt to genuine changes in distribution.

Deep learning approaches might involve training a model that outputs uncertainty estimates, and if uncertainty surpasses a threshold consistently, this indicates a distribution shift. Another approach is to maintain a running summary of latent representations (e.g., from the last layer of a neural network) and periodically check for cluster or density changes in that representation space. Although these advanced methods offer more flexibility, they also require more computational resources and tuning.

Below are additional follow-up questions

How do we measure the performance of a change detection algorithm in real-world scenarios?

Measuring performance involves comparing the times at which the algorithm flags a change with the “true” change points (often annotated by domain experts). In practice, an algorithm might flag a change slightly earlier or later than the actual transition. Common metrics include detection delay (how long it takes to detect a true shift), false positive rate (how often the algorithm mistakenly signals a change), and missed detection rate (how often the algorithm fails to catch an actual shift).

One pitfall is that real data may not have perfectly labeled change points. It can be subjective to define the exact moment a distribution started to shift, especially in gradual scenarios. Practitioners often allow for a tolerance window around the labeled change, saying, for instance, that if detection occurs within +/- a certain range of the true change time, it counts as a successful detection. Another subtlety is the possibility of multiple minor changes vs. one major shift; an algorithm might fire several times if it’s highly sensitive, producing many “partial” detections. Balancing sensitivity with specificity is critical.

When the cost of missing a change is extremely high (for instance, in critical infrastructure), one might prioritize a low missed detection rate, even if this comes with more false positives. Conversely, in applications where each false alarm is costly, more conservative thresholds are chosen. Ultimately, the decision on which metric to prioritize often depends on domain requirements and the severity of errors.

How do we deal with seasonality or cyclic behavior in the data when performing change detection?

Seasonality or cyclic patterns can cause naive change detection methods to flag changes repeatedly, since the data distribution can shift dramatically at regular intervals (daily, weekly, monthly, etc.). If we do not account for these expected periodic variations, the algorithm might misinterpret normal cycles as true changes.

A typical strategy is to de-seasonalize or remove the known periodic components from the data before applying a change detection algorithm. For example, in a daily cycle, one can subtract a baseline that reflects the known daily pattern, leaving residuals that ideally exhibit stationarity if no true change has occurred. If a significant anomaly arises in the residuals, it’s more likely to represent a genuine shift rather than normal cyclic fluctuation.

Another method is to use a time-series forecasting model (like ARIMA, SARIMA, or deep learning models) that explicitly captures seasonality. You can compare the model’s predictions to actual observations and analyze the deviations for sustained shifts. If the model’s error remains low and consistent, there’s likely no major change. But if the error distribution shifts significantly, it may signal a true change beyond normal seasonal patterns. The challenge lies in updating or recalibrating these models if the seasonal pattern evolves over time.

How do we handle simultaneous shifts in distribution and correlation among multiple features?

In many real-world applications, change detection is not only about a univariate shift in mean or variance but can involve changes in correlations among multiple features. For instance, in financial markets, certain assets might start moving together when they previously did not, or in sensor data, temperature and pressure readings might develop new correlations.

One approach is to monitor covariance matrices or correlation structures. If the covariance matrix changes significantly (e.g., measured by some distance metric like the Frobenius norm or more sophisticated divergences), it can indicate a shift. However, high dimensionality complicates this, as the covariance matrix becomes large and can be noisy. Dimensionality reduction techniques (PCA, autoencoders, etc.) can help by projecting the data onto fewer dimensions, capturing the strongest correlation patterns, which can be monitored for shifts.

Another approach is to apply a dedicated multivariate change detection method. Some algorithms (like multivariate CUSUM or Bayesian methods that handle covariance changes) can directly assess the joint distribution. Pitfalls arise when a small subset of features is primarily responsible for the shift, but their effect is diluted by many unchanged features. Vigilance about which dimensions are actually relevant is key. Domain knowledge often helps to focus on subsets of features most likely to co-vary or shift together.

How can we incorporate external or contextual factors into a change detection pipeline?

External factors, like economic indicators, weather changes, or marketing campaigns, can drastically affect the data and might be mistaken as distribution shifts if not accounted for. If these external influences are known and measurable, one approach is to explicitly include them in the modeling. For instance, you might build a regression model or an ML pipeline that takes both the primary data and these external signals as inputs, generating an expected output. Any large, sustained deviation of actual output from the expected output—after controlling for external factors—can be flagged as a potential change.

Alternatively, a system might adaptively adjust its threshold or baseline depending on external context. For example, e-commerce traffic might spike every year on Black Friday. The model could incorporate a “Black Friday factor” so that a sudden jump in traffic on that day isn’t flagged as a change. A pitfall is overfitting to external factors that are not truly relevant or that change in ways not captured by historical data. Continuous monitoring and periodic re-validation of the importance of these external factors is crucial.

How do we decide between offline and online approaches for change detection?

An offline (batch) approach assumes access to all or large chunks of data, allowing algorithms to process the entire sequence at once and potentially find optimal change points retrospectively. This is appropriate for historical analysis or scenarios where immediate detection is not critical. Offline methods, such as offline Bayesian changepoint detection or segment-based approaches, can produce globally optimal solutions but are not suitable for real-time applications.

Online methods process the data one sample (or batch) at a time and must make decisions as they go. They typically update their statistics incrementally, signaling a change as soon as it is recognized. This is essential for real-time systems that need immediate alerts. The trade-off is that online methods may not always detect changes at the exact moment they occur and can have more difficulty refining boundary estimations of change points. They also risk higher false positives if thresholds aren’t well tuned.

When data must be acted upon quickly—like in fraud detection, industrial sensor monitoring, or dynamic resource allocation—online methods are a must. In some domains, a hybrid approach is used: an online detector flags potential changes in real time, and an offline post-processing method confirms or refines these detections for final decisions or record-keeping.

How can we handle extremely high-speed data streams where latency constraints are strict?

In high-velocity data scenarios—like network traffic monitoring at scale, IoT sensor streams, or high-frequency trading—algorithms must be efficient in both time and memory. Computing complex statistical tests or large matrix operations for each incoming data point can be infeasible.

Some strategies involve:

Maintaining rolling or exponentially weighted summary statistics (e.g., running means, variances, correlation estimates). Instead of storing raw data, store aggregates that update with O(1) or O(d^2) complexity, where d is the dimension.
Using streaming-friendly algorithms like incremental CUSUM or other incremental detectors that update a small state variable rather than reprocessing the entire dataset.
Downsampling the data if the time resolution is higher than needed. This is a trade-off: too much downsampling can miss quick changes, so domain expertise is crucial.
Utilizing specialized hardware or distributed frameworks (e.g., Apache Flink, Apache Spark Streaming) to parallelize computations. Even then, the algorithm must be designed to handle partial or delayed data, and to merge partial statistics from distributed nodes.

In these setups, a major pitfall is data backlog or data bursts. If the system gets temporarily overwhelmed, detection might be delayed or partial. Robust architecture design with streaming queues and fault tolerance helps maintain real-time capabilities under fluctuating loads.

How do we distinguish between short-lived or temporary changes and long-term regime shifts?

Sometimes a distribution changes momentarily but then returns to its original form. If the change detection system immediately resets its baseline, it may incorrectly adapt to a temporary anomaly. On the other hand, if the system is too cautious, it might fail to capture genuine long-term shifts.

One way is to define a “persistence criterion”—that is, a change is only confirmed if the new behavior persists for a certain number of observations or a certain duration. This can be implemented by waiting until the post-change distribution stabilizes. However, this delays detection slightly and may be problematic in mission-critical settings that demand immediate alerts.

An alternative is to track how quickly or slowly the new distribution diverges from the old. If the data quickly reverts, the cumulative deviation remains below the change threshold. Another practical method is to use a second-level detection step: once an algorithm flags a change, a short grace period is monitored to confirm if the distribution continues in the same new regime. If it does, the change is marked final. If it reverts, the system treats it as a transient event and resets to the old baseline.

Could ensemble techniques enhance reliability in change detection?

Yes. Ensemble techniques can combine multiple different detectors (e.g., a CUSUM-based detector, a KS-test-based detector, and a machine learning-based detector) and aggregate their signals. When most or all detectors concur on a change, confidence in the detection is higher. When only one detector flags a change while others remain stable, the system can wait for further evidence or weigh the reliability of each detector differently.

Pitfalls include higher computational overhead and the complexity of managing different detectors’ outputs. Some detectors might be well-tuned for abrupt changes but fail on gradual drifts, while others might excel at picking up slow shifts but have a high false positive rate for rapid transitions. The challenge is deciding how to weigh each detector’s voice. If done well, ensembles can reduce both false positives and missed detections by leveraging complementary strengths.

How do we handle partially labeled data for validating or training a change detection algorithm?

In many real-time scenarios, you might only have partial labels for when changes occurred, or you might suspect changes without definite timestamps. One approach is semi-supervised or weakly supervised learning. You can label certain segments with high confidence while leaving the rest unlabeled. The model then tries to learn a representation of normal vs. changed segments from these partial labels.

Another approach is to artificially inject synthetic shifts into unlabeled data. For instance, you can splice two distinct subsets from different distributions and place them sequentially, thus creating a known artificial change point. Then, you can see if your method detects that shift. This technique has limitations because synthetic changes might not fully reflect real-world transitions.

Active learning can also help. The detector flags potential change segments, and then you query domain experts to confirm or refute these changes. Over time, you build a more robust labeled dataset of actual changes. The main pitfall is the time and cost of expert labeling, which might be substantial if changes occur frequently or data volume is large.

What steps can we take after detecting a change to diagnose its root cause?

Once you detect a change, deeper investigation is often required to understand why it occurred. Common steps include:

Checking which features contributed most strongly to the shift. For instance, in a manufacturing setting, maybe a particular sensor started reading differently due to calibration drift.
Inspecting time ranges just before and after the change for anomalies or correlated events (e.g., system updates, new software releases, policy changes, or external environmental factors).
Running domain-specific diagnostic tests. For example, if it’s a network anomaly, you might analyze logs for suspicious IP addresses or traffic patterns. If it’s in e-commerce data, you might look at user demographics or promotional events.
Using interpretable machine learning techniques (like feature importance, SHAP values, or LIME) to reveal which features in a data-driven approach had the greatest impact on detection. This can highlight the root cause.

A potential pitfall is misattributing correlation as causation. Just because a feature changed around the same time doesn’t necessarily imply it caused the shift. Multiple correlated events might coincide, so thorough domain knowledge and possibly controlled experiments are necessary to confirm actual causes.

Rohan's Bytes

Discussion about this post