ML Interview Q Series: How would you handle missing values in a large payment transaction dataset for fraud prediction?
📚 Browse the full ML Interview series here.
Short Compact Solution
One practical approach is to first explore what is missing, how much is missing, and why it might be missing (e.g. is it missing completely at random or is there a systematic reason?). Then, decide on a baseline model to see if ignoring the missing data altogether yields acceptable results. If needed, you can impute the missing data using methods such as mean or median for numerical columns, or use algorithms that infer the best replacement values based on the rest of the features (e.g. nearest neighbors). After imputation, you should measure whether model performance actually improves; if it does not, dropping the missing data might be simpler. Finally, it can also help to consider external datasets that fill in some missing information, or use models and techniques that can directly handle missing values without explicit imputation.
Comprehensive Explanation
Clarify the Missing Data
Dealing with missing data begins by understanding its nature and scope. It is helpful to ask questions like:
How many features have missing values?
Is the missingness spread uniformly across features or is it concentrated in a few columns?
Are the missing values numerical or categorical, and do they occur under specific conditions (e.g. certain types of transactions)?
Is there a business or domain-related reason why certain features are missing?
These questions help in forming hypotheses around whether the data is Missing Completely At Random (MCAR), Missing At Random (MAR), or Not Missing At Random (NMAR). For instance, if high-value transactions are more frequently missing location details, that suggests the missingness is not random and may require special handling.
Establish a Baseline
Sometimes, missing data is not truly an issue for your predictive goal. You can do the following before applying complex imputation methods:
Build a simple baseline model (e.g. a basic logistic regression or decision tree) that either drops rows with missing data or ignores the columns that are incomplete. Check if it meets the required business metrics.
Investigate whether the features with missing values have substantial predictive power. If a feature like a user’s middle name is often missing and does not correlate with fraud, it may not matter much for final performance.
Evaluate cost-benefit: if the model performs adequately without any advanced approach to missing data, you might save time and complexity by leaving those features out or simply dropping missing rows.
Impute Missing Data
If the baseline indicates that the missing data is relevant and is hurting model performance, various imputation techniques can be tried:
Mean / Median / Mode Imputation: A simple approach for numerical data is replacing missing values with the mean or median of the non-missing values. For categorical data, you might use the most frequent category. This is fast but does not leverage relationships between features.
Nearest Neighbors Imputation: Search for rows with similar values in other features and use them to fill in the missing data. This can work well if there are strong correlations among features but is more computationally expensive.
Model-Based Imputation: Train a regression or classification model to predict missing values of a particular feature from the other features. This can handle complex relationships but needs additional care to avoid leakage and to handle time/causality constraints (for instance, if certain features would not be known at the time of prediction).
Iterative Imputation (MICE): More advanced approaches (e.g. Multiple Imputation by Chained Equations) iteratively impute one feature at a time using the relationships with other features. This often provides more accurate estimates but can be slow for large datasets.
Below is a basic illustration of mean/median imputation in Python:
import pandas as pd
from sklearn.impute import SimpleImputer
# Suppose df is your transactions DataFrame
# For numerical columns, we use median imputation
median_imputer = SimpleImputer(strategy='median')
df['some_numeric_feature'] = median_imputer.fit_transform(df[['some_numeric_feature']])
# For categorical columns, we might use the most frequent imputation
freq_imputer = SimpleImputer(strategy='most_frequent')
df['some_categorical_feature'] = freq_imputer.fit_transform(df[['some_categorical_feature']])
Check Performance with Imputed Data
Once you have imputed the data, retrain your fraud detection model and compare results (e.g. via cross-validation) against the baseline:
If performance improves significantly, it indicates the imputed feature has predictive value.
If performance does not improve or only marginally improves, it might not be worth retaining that feature or performing complicated imputation.
Sometimes it is simpler and equally effective to remove columns with too many missing values or rows that are incomplete if they are a small fraction of the dataset.
Other Approaches for Missing Data
Leverage External Data: If a certain column (e.g. type of business for a transaction) is missing, you might integrate a third-party source (such as a public business registry) to fill in that feature.
Models that Handle Missing Data Automatically: Some implementations (like certain tree-based methods) can handle missing values by assigning them to special branches or treating missingness as a separate category. This approach requires careful experimentation and performance checks.
Treat Missingness as a Feature: Sometimes the fact that a value is missing is itself predictive. For example, if certain suspicious transactions are often missing location fields, encoding “missing” as a separate category or adding a “missing indicator” feature can help the model learn that pattern.
Practical Considerations
Data Leakage: When imputing data, make sure to separate your train and test sets (or cross-validation folds). Fit the imputation model only on training data, then apply it to the test set. This avoids leaking future information into the training process.
Resource & Time Constraints: Imputation can be computationally expensive on very large datasets. Be mindful of the trade-off between improved performance and increased training time.
Business Risk: In fraud detection, false negatives (failing to flag a fraudulent transaction) can be very costly. Evaluate whether missing data might cause an underestimation of risk.
Summary
The overall strategy is to first diagnose the cause, extent, and potential impact of missing data. Then, based on how strongly the missing columns correlate with the target variable and how large the missing portion is, you either ignore, drop, or impute. Always validate through experiments and cross-validation whether dealing with these missing entries actually improves your fraud detection performance.
Follow-Up Questions
How would you decide whether to drop rows, drop columns, or impute?
One must consider the relative predictive importance of each feature and the percentage of missing entries. If a feature is critical (e.g., IP geolocation for a potential transaction) and has modest missingness, imputation is worth attempting because losing it might significantly reduce model performance. If a column is rarely populated or is only weakly correlated with the target, dropping that column might be simpler. If only a small subset of rows are missing crucial values, dropping those rows could be acceptable. Always test the performance implications—empirical validation in your domain is key.
How do you handle a situation where missing data itself might be a strong indicator of fraud?
Sometimes missing data is not random; for instance, fraudsters might intentionally leave out certain details. In such cases, treat missingness as a feature. You can create a binary indicator variable (e.g., “is_field_X_missing”) that explicitly signals when a feature is not available. This approach can help the model learn a pattern of missingness correlated with fraud. You may combine this with imputation methods or let the model handle the default “missing” category.
What if the dataset is extremely large and conventional imputation methods are too slow?
Scaling imputation strategies is challenging for large datasets. Approaches include:
Approximate Nearest Neighbors: Using approximate methods like annoy or faiss to find similar rows more efficiently than a standard k-NN approach.
Batching: Split the dataset and impute in batches, then combine results carefully.
Distributed Computing: Tools like Spark can help distribute imputation tasks across multiple machines.
Heuristic Approaches: Using simpler imputations like mean, median, or creating a smaller, representative sample to learn advanced imputers, then applying them to the full dataset.
How do you guard against data leakage with advanced imputation techniques?
If you train a model to impute missing values across your entire dataset (including the test set), you risk leaking future information into the training process. To prevent this:
Split your data into training and test sets (or folds for cross-validation) before imputation.
Fit imputation models (e.g., mean, median, or model-based) only on the training split.
Apply the imputation transformations to the test split using parameters learned from training data (e.g., the mean or median from the training set).
This ensures that the imputer’s parameters are not influenced by the test data.
Can deep learning models help with missing data?
Some deep architectures can learn to handle partial inputs directly, especially if you design them with missingness in mind. For instance:
Variational Autoencoders (VAEs) can be adapted for imputing missing data by reconstructing incomplete feature sets.
Masking Strategies: You can supply a mask tensor indicating which positions are missing, allowing the network to learn patterns in the masked input.
Denoising Autoencoders: They are often used for data reconstruction tasks, including imputation scenarios.
However, these methods require careful hyperparameter tuning and robust validation, especially in a high-stakes domain like fraud detection.
Below are additional follow-up questions
How do you handle a situation where the pattern of missing data changes over time (concept drift)?
When the pattern of missing data evolves, it can mean that features that used to be consistently populated no longer are, or new features gradually have rising rates of missingness. This phenomenon is an example of concept drift, specifically in the distribution of missingness rather than (or in addition to) the label distribution.
A common approach is to monitor missing data rates and distributions in production, and set up alerts if the proportion of missingness exceeds certain thresholds. If a shift is detected—say a feature that used to be fully populated is now often missing—several steps might be taken:
Reassess the source: Check if there was a change in the upstream data pipeline. Sometimes a new data collection method or a broken API causes a spike in missingness.
Dynamic imputation: If the missingness becomes more frequent or systematically different, you might need a retrained imputation model that reflects new conditions (e.g. using rolling windows of recent data to update imputation parameters).
Adaptive models: Consider online or incremental learning methods that can adapt to changing data distributions in real time. These models can potentially learn new patterns of missingness on the fly.
Fallback rules: In a fraud detection context, you might implement fallback rules (like rule-based checks) for transactions missing certain critical features if your standard model no longer applies reliably.
A key pitfall here is ignoring the drift until performance deteriorates significantly. Another subtle issue is that you might incorrectly attribute performance drops to the model itself, rather than investigating whether changes in how data is collected are the real culprit.
What strategies work best when multiple critical features are missing in the same row?
Some rows might have multiple features missing—this can happen when a particular data collection stage fails or an entire segment of transactions never supplies certain details. In such a case:
Evaluate row-level importance: If the row is missing too many critical features (for example, 70% of them), it may not be salvageable, and dropping that row could be more prudent—especially if such rows are a tiny fraction of the data.
Separate imputation for each feature: If the missingness is not too severe, you could apply iterative or model-based imputation. For instance, using Multiple Imputation by Chained Equations (MICE) where each feature is iteratively predicted based on all the other available features.
Domain-specific rules: In fraud detection, certain missing features might disqualify the row from standard processing. For example, if a user’s country of origin and IP address are both missing, it might trigger extra verification steps or manual checks before completion of the transaction, rather than an automated classification.
Hybrid approach: Sometimes, you can first try simpler imputation for less important features (e.g., filling in average transaction amount) and see if the row still has enough predictive features. If extremely critical features (like user ID) are missing, that row might require a fallback rule-based approach.
A subtle pitfall is performing imputation on multiple features simultaneously without accounting for interdependencies. For instance, if “Shipping Address” and “Billing Address” are missing together, but they are usually correlated, you risk imputing inconsistent data if you treat them as independent.
Can you discuss specialized evaluation metrics or methods for comparing imputation strategies in a fraud detection context?
Measuring which imputation strategy works best involves both standard machine learning metrics and specialized analyses for missing data:
Baseline Model vs. Imputed Model: First, compare classification metrics such as AUC (Area Under the ROC Curve), precision, recall, F1-score, or business-specific metrics like financial loss from fraud. You want to see if imputation yields a significant lift over a simpler baseline (like dropping missing rows).
Data Utility Metrics: Some practitioners look at “imputation error” by withholding a portion of the known data as if it were missing, imputing it, and then measuring how close the imputed values are to the real values. While this is common in academic settings, it can still give insights into the imputation’s accuracy.
Cost-based Metrics: In fraud detection, you might measure how often an imputed dataset triggers false positives or false negatives. A mismatch might translate into either additional friction for legitimate users or missed fraud attempts—both of which have associated costs.
Timing Metrics: Certain imputation methods might add latency in a real-time fraud detection pipeline. If the more complex strategies improve accuracy but make real-time scoring infeasible, that trade-off must be evaluated.
A subtle issue is that the “best” imputation method from a purely predictive standpoint might not be the best from a cost or operational complexity perspective. Another pitfall is ignoring the business context: a slight improvement in AUC might not justify a major increase in computational overhead if real-time decisions are critical.
What if different columns have missing values that are heavily correlated with each other?
When multiple features tend to be missing at the same time (highly correlated missingness), it can reflect deeper patterns or data collection issues. For instance, certain user segments may fail to provide both their billing address and phone number. Approaches to handle this scenario:
Joint Imputation Models: Instead of imputing each feature independently, use a joint approach such as MICE, which iteratively uses other features (including those also imputed) to arrive at consistent estimates. This method captures the dependency structure.
Create a Missingness Profile: Treat the specific pattern of missingness as a separate category or indicator. If (billing address missing, phone number missing) is a common pattern in suspicious transactions, the model can learn that combination as a strong signal.
Investigate the Source: Often, correlated missingness reveals an upstream process failure (e.g., a certain data provider or form field is never populated for a specific user group). Fixing it at the source can be more effective and reduce the need for post-hoc solutions.
A subtle pitfall is to assume that correlation in missingness necessarily implies correlation in the actual data values. You may need domain knowledge to confirm that the correlation in missingness is not purely an artifact of how data is collected.
How do you ensure that imputation does not eliminate outliers that might be important signals for fraud?
Outliers can be crucial markers of unusual or suspicious behavior. Traditional imputation methods like mean or median replacements can “pull” outliers back toward more typical values, erasing potentially vital signals:
Flag Potential Outliers Before Imputation: One strategy is to detect outliers in the non-missing data before deciding how to handle missing points. If an extreme value is truly indicative of fraud, you do not want to lose that signal.
Include an Indicator Feature: If the original data could be an extreme outlier, setting an additional binary feature (“was_imputed”) helps the model learn that the value might be suspect.
Robust Imputation Methods: For numerical features, consider robust statistics (like median rather than mean), or distribution-based approaches that preserve the overall shape of the data, so that potential outlier ranges remain feasible.
Quantile-Based Imputation: Instead of a single mean or median, you might condition on a distribution for imputation, sampling values from high or low tails in proportion to how frequently outliers occur in the original data.
A subtlety is that not all extreme values are indicative of fraud—some legitimate large transactions happen. Overzealous outlier handling can lead to false alarms or penalize high-value, legitimate customers.
How can you handle missing values in a streaming or real-time fraud detection system?
In real-time systems, you often have limited time to make a decision. Traditional offline imputation or complex iterative methods may be too slow:
Online Imputation: Maintain rolling estimates of means, medians, or other statistics that are updated as new data arrives. This allows quick filling of missing values based on the latest window.
Fallback Logic: If a feature needed for a real-time decision is missing, you might have a predefined fallback rule (e.g., “if location is missing, assign a higher risk score”).
Adaptive Models: Some online learning algorithms can naturally handle missing data by ignoring or weighting incomplete features during training and prediction.
Pipeline Resiliency: Build resilience in your data pipeline. If data from one source is delayed or temporarily offline, you have logic to handle it gracefully rather than stalling the entire system.
A real-world pitfall is the sudden spike of missing data for a critical feature during peak transaction times (e.g., a holiday season). Ensuring the system can gracefully degrade is vital, or you risk introducing large latencies or systematically missing suspicious transactions.
If you have a model that does not natively handle missing data, how can you integrate partial-information rows without losing them?
Some algorithms (like certain linear models or deep networks) generally require complete feature vectors. Strategies to incorporate incomplete data:
Impute at Training: Perform imputation on the training set so each training example is fully specified. At inference time, apply the same imputation rules or model to fill missing features for new transactions.
Feature Subset Models: In some specialized setups, you can train multiple models that handle different subsets of features. For example, if you notice certain subsets of features are typically available together, each model can handle that subset. Then you combine or ensemble them. However, this is more complex to maintain.
Use a Specialized Missingness Module: For neural networks, you might create a mask input that indicates which features are present. The network can learn to rely less on missing features during inference.
A subtle pitfall is inconsistent application of imputation at training vs. inference time. If the distribution of missingness in production is different than in training, your model might end up mis-calibrated.
How do you decide when to bring in domain experts or business analysts for guidance on missing data?
While data scientists can detect patterns of missingness, domain experts often know the context that explains why certain fields are incomplete:
Critical or Complex Features: If the missing feature is central to the decision-making process (e.g., a user’s device info or transaction geolocation in fraud detection), domain experts can provide insight into how that data is typically captured or why it might be absent.
High-Stakes Decisions: If a model’s performance has major business impact (like preventing multi-million-dollar fraud), you want to confirm the imputation approach with subject matter experts to avoid hidden biases or unintended consequences.
Opaque or Legacy Systems: In older data pipelines, certain columns might be placeholders with no real meaning, or might have changed their meaning over time. Domain experts can clarify these nuances.
Unexpected Missing Patterns: Sometimes, an unusual pattern might be a clue that a new product feature is generating incomplete data in a specific region or for certain device types. A domain expert or product manager can confirm or deny these hypotheses.
A subtle edge case arises when domain knowledge conflicts with purely data-driven analysis. For example, experts may claim a particular feature shouldn’t be missing for certain customers, yet your data might show that it is. In such conflicts, investigating the pipeline thoroughly is essential; there might be a bug or an undocumented process change.