ML Interview Q Series: Cost-Sensitive Classification for Loan Fraud: Balancing Financial Error Costs.
📚 Browse the full ML Interview series here.
31. Assume we have a classifier that produces a score between 0 and 1 for the probability of a particular loan application being fraudulent. a) What are false positives? b) What are false negatives? c) What are the dollar trade-offs between them, and how should the model be weighted accordingly?
This problem was asked by Affirm.
Understanding the Basic Terminology
In the context of a fraud detection classifier for loan applications, we are dealing with a binary classification problem. We label loan applications as either “fraudulent” or “legitimate.” The classifier outputs a probability (between 0 and 1) that an application is fraudulent, and we choose a threshold to decide whether to classify it as fraudulent or not.
When we talk about “positives,” we mean the instances (loan applications) classified as “fraudulent.” When we talk about “negatives,” we mean the instances classified as “legitimate.” Whether these classifications are correct or incorrect leads us to four terms: True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
(a) What Are False Positives?
A false positive in this context is when the classifier labels a loan application as fraudulent, but in reality, it is not fraudulent. This means the model predicted “fraudulent” but the true state was “legitimate.”
In a practical banking scenario, a false positive would lead to subjecting a perfectly honest loan applicant to additional scrutiny or possibly rejecting their loan application altogether. This can have multiple consequences:
The customer might experience frustration or inconvenience. The bank might lose revenue from legitimate borrowers who walk away in frustration because of additional red tape or outright rejection.
(b) What Are False Negatives?
A false negative is when the classifier labels a loan application as legitimate, but in reality, it is fraudulent. The model predicted “legitimate” but the truth is “fraudulent.”
From a practical standpoint, a false negative is potentially more damaging if the bank disburses money to a fraudulent applicant. This can lead to:
Direct financial loss to the bank. Reputational damage because the bank failed to catch the fraud.
(c) What Are the Dollar Trade-Offs and How Should the Model Be Weighted?
False positives and false negatives do not necessarily carry the same financial cost. In a fraud detection setting:
A false negative (letting a fraudulent application slip through) can lead to a high financial loss. Suppose each fraudulent loan causes a large dollar loss (for example, a significant portion of the loan amount is unrecoverable). In that case, the cost of a single false negative might be very high.
A false positive (flagging or rejecting a legitimate application as fraudulent) also has a cost, but it is usually the lost profit or opportunity for the bank plus customer dissatisfaction. If the lost profit or reputational harm from incorrectly rejecting a legitimate loan is smaller than the direct loss from granting a fraudulent loan, then a false negative is considered more costly.
Because of these differing monetary impacts, the decision boundary or threshold in the classifier often needs to be set more conservatively. If letting a fraudulent loan slip through is extremely costly, you typically want fewer false negatives. That usually means shifting the decision threshold so the model classifies a loan as fraudulent more readily. However, this can increase the false positives, which then has its own cost.
In practice, banks often weigh these losses with a cost matrix or a direct expected cost function. A simplified version of such a cost function could be represented as follows:
where:
FP is the number of false positives,
FN is the number of false negatives,
Cost_FP is the cost of a single false positive,
Cost_FN is the cost of a single false negative.
If Cost_FN is much higher than Cost_FP (which is generally the case with fraud), the model should be tuned to reduce false negatives. This might be done by lowering the probability threshold for classifying a loan as fraudulent, ensuring more suspicious loans are flagged. The trade-off is that as you lower the threshold, you raise the number of false positives. Balancing this is a business decision: you want to minimize overall cost, and this often means carefully quantifying both types of errors in monetary terms.
Deeper Explanation of the Cost Trade-Off
To understand how to weight the model accordingly, data scientists and product owners typically:
Estimate the cost (or potential loss) associated with each fraudulent loan that gets approved (false negative). Estimate the cost (or lost revenue / customer churn) associated with each incorrectly rejected or flagged legitimate loan (false positive). Perform scenario analyses by varying the decision threshold and seeing how many FPs and FNs result.
In real-world systems, you might simulate or cross-validate different thresholds to compute an estimated total cost under each threshold, then pick the threshold that gives you the lowest total cost or best cost-benefit ratio.
Potential Implementation Details
One approach is to train a model (for instance, a logistic regression, random forest, or neural network) and obtain a probability output. Then you can do:
import numpy as np
from sklearn.metrics import confusion_matrix
def find_optimal_threshold(probabilities, labels, cost_fp, cost_fn):
thresholds = np.linspace(0, 1, 101)
min_cost = float('inf')
best_threshold = 0.0
for t in thresholds:
preds = (probabilities >= t).astype(int)
tn, fp, fn, tp = confusion_matrix(labels, preds).ravel()
cost = cost_fp*fp + cost_fn*fn
if cost < min_cost:
min_cost = cost
best_threshold = t
return best_threshold, min_cost
You can then pick the threshold that yields the lowest total cost according to your chosen cost_fp and cost_fn values. This is a simplistic example that assumes constant costs. In reality, cost could depend on loan size, applicant’s profile, or other factors that might make a certain loan approval riskier or more lucrative.
When the financial impact of fraud (false negatives) is extremely high, the best threshold is typically closer to the lower end. That means you are more aggressive in flagging loans as fraudulent. However, if the cost_fp is also large—maybe each flagged application requires significant manual overhead and potential reputation damage if too many customers are annoyed—then the threshold might not be shifted too drastically. It is always about finding a balanced point of minimal expected cost.
Potential Pitfalls in Real-World Scenarios
Model drift and changing fraud patterns can mean that costs associated with false negatives and false positives vary over time. A threshold that was optimal at one point may not remain optimal, so recalibration is essential. Loans are often high-value transactions with extremely skewed class distributions: genuine applications far outnumber fraudulent ones. This can lead to poor training if the model is not carefully set up to handle class imbalance. Over-reliance on manual reviews (in the case of flagged applications) can become extremely expensive if false positive rates are too high.
Possible Follow-up Question 1: How do we determine the exact cost of a false positive and a false negative in real-world systems?
In many FANG-level interviews, the hiring panel wants to ensure the candidate understands that cost determination is a collaborative, cross-functional exercise, involving stakeholders from finance, operations, and data science.
In a real-world banking environment, a “false negative” cost might involve: The loan principal that can never be recovered. Investigation cost and legal fees. Reputational loss if fraud is not caught systematically.
A “false positive” cost might involve: Lost interest revenue from a legitimate loan that the bank rejected. Operational overhead due to time spent reviewing flagged cases. Customer churn and the intangible reputational damage from incorrectly flagging legitimate customers.
Organizations often create approximate numerical estimates by examining historical data (e.g., the typical loss per incident of fraud, the average revenue from a typical approved loan, etc.). Then they incorporate those values as cost_fn and cost_fp. These values can evolve over time as market conditions or fraud patterns change.
Possible Follow-up Question 2: How do we handle extreme class imbalance when fraud rates are very low, and how does that affect false positives and false negatives?
In many credit fraud settings, fraudulent loans might represent a tiny fraction of all applications, such as well under 1%. This situation (highly imbalanced data) can impact both training dynamics and threshold selection.
Key considerations: Training data must be managed carefully. Standard metrics like accuracy can be misleading if the data are heavily imbalanced. Additional metrics such as Precision, Recall (especially for the fraudulent class), and F1 score become more relevant. Often the Recall for the fraudulent class is critical, because missing a fraudulent loan (false negative) is extremely costly. The number of false positives can be large in absolute terms when the threshold is lowered to catch more fraud. Even though you might improve recall for fraud detection, you might end up flagging a large number of legitimate loans. Hence, the cost trade-off analysis becomes even more essential, because the absolute amount of false positives can overwhelm manual review processes or annoy a large pool of genuine customers.
One way to address this is to use specialized metrics or to apply oversampling / undersampling / synthetic data generation for the minority class to help the model learn fraudulent patterns more effectively. However, even after robust training, the threshold selection is still done by balancing the cost of missing fraud (false negative) against the cost of over-flagging legitimate applicants (false positive).
Possible Follow-up Question 3: How do we decide which metrics to optimize beyond a standard accuracy measure?
It’s usually vital to optimize a metric that captures the financial priorities. Accuracy alone can be misleading with highly imbalanced data. The following metrics are frequently used:
Precision for the fraudulent class (Of all applications flagged as fraudulent, how many are actually fraudulent?). This helps measure how many false positives we have. Recall for the fraudulent class (Of all actual fraudulent applications, how many did we catch?). This helps measure how many false negatives we have.
Combining Precision and Recall: The F1 score is the harmonic mean of precision and recall. It’s often used when you want to find a balance between false positives and false negatives, but it does not directly incorporate the dollar cost difference between these two types of mistakes.
When the cost of false negatives is significantly higher, you may decide to optimize for a metric that prioritizes recall, and then perform a final check that false positives do not exceed an acceptable cost. You can incorporate cost-sensitive methods in training, or you can choose a threshold based on a cost matrix or cost function, as discussed previously.
Possible Follow-up Question 4: How might we incorporate dynamic or loan-size-based weighting?
Sometimes not all loans are of the same amount, meaning a missed fraudulent application of $100,000 is worse than a missed fraudulent application of $1,000. Similarly, a high-value legitimate loan might generate substantial profit, so rejecting it by mistake (false positive) can be more detrimental.
This scenario can be handled by building a more advanced cost function at the instance level, where each instance (loan application) has its own cost. You might do:
Predict the probability of fraud for each loan.
Multiply that probability by the potential financial impact of a fraudulent outcome.
Use dynamic thresholding or custom objective functions that reflect the cost proportional to the loan size.
During training, you might incorporate these instance-level weights into the loss function. Many gradient boosting frameworks (and even neural network frameworks) allow for sample weights to be specified. This way, you penalize errors on large loans more than errors on small loans.
Possible Follow-up Question 5: How do we ensure the model remains fair and not biased against certain demographic groups if we tighten thresholds to reduce false negatives?
In many regulated industries, including lending, fairness is crucial. You do not want the model to systematically generate more false positives for certain protected groups (e.g., certain demographics, regions, or income brackets). When you shift thresholds to reduce false negatives, you might inadvertently skew false positives toward groups that historically have lower credit or certain patterns in their applications.
To address this:
Monitor fairness metrics such as disparate impact, false positive rate parity, or false negative rate parity across demographic groups. Incorporate fairness-aware training or post-processing strategies that adjust thresholds per subgroup in a manner consistent with compliance requirements. Work closely with legal and compliance teams to ensure that the model does not violate regulations regarding credit and lending.
The overarching principle is that while you manage the cost of fraud, you also maintain compliance, fairness, and minimize discriminatory outcomes.
Summary of Reasoning Behind Each Answer
The underlying logic always revolves around understanding the confusion matrix, distinguishing between the cost (financial, reputational, operational) of false positives versus false negatives, and making informed decisions on how to set or adjust model thresholds. Real-world constraints such as changing fraud patterns, high class imbalance, dynamic loan amounts, and fairness considerations make it essential to continually monitor and recalibrate both the model and its decision boundary.
All these factors are crucial for FANG-level interview discussions, because they illustrate an in-depth understanding of practical machine learning deployment in a financially sensitive context.
Below are additional follow-up questions
Possible Follow-up Question 6: What strategies can be used to prevent overfitting when building a fraud detection model, and how do we identify if the model is indeed overfitting?
When training a fraud detection classifier, overfitting is a common pitfall. Overfitting occurs when the model fits the noise or peculiarities of the training data too closely and fails to generalize to new, unseen data. This is especially critical in fraud detection because the cost of deploying an overfit model can be high, potentially leading to unexpected surges in both false positives and false negatives in real-world usage.
To prevent overfitting, we can deploy various strategies:
Regularization Techniques One major strategy is to add a regularization term to the loss function. In a neural network, this might be done using weight decay (L2 regularization) or by constraining the norm of the weight vector. In tree-based models like Gradient Boosted Decision Trees (GBDT), you can limit tree depth or increase the minimum number of samples required to split a node to reduce variance. The concept behind regularization is to penalize the magnitude of parameters or the complexity of the model, nudging it to learn generalizable patterns instead of memorizing anomalies in the training set.
Early Stopping In iterative algorithms like gradient boosting or neural network training, monitoring the validation error can help stop training when performance on a hold-out set stops improving (or begins to degrade). Early stopping ensures the model doesn’t continue learning the noise present in the training data. This involves splitting the dataset into training and validation sets. The training set is used to update the model, while the validation set monitors if generalization is improving or not.
Cross-Validation Using k-fold cross-validation helps identify if performance metrics (e.g., AUC, precision, recall) are consistent across multiple folds. If the model performs extremely well on one fold but poorly on others, this is a signal that the model might be overfitting to specific patterns in the training subset. Cross-validation also ensures the training data is used effectively, particularly in fraud detection where fraudulent cases can be rare and data is precious.
Data Augmentation and Imbalanced Handling In fraud detection, class imbalance can exacerbate overfitting if the model only sees a limited number of fraud examples. Techniques like SMOTE (Synthetic Minority Over-sampling Technique) or carefully designed oversampling can expand the fraudulent class in the training set. This helps the model generalize better and not overfit to the limited fraud examples.
Observing Model Generalization in Real-Time or on a Hold-Out Test Set A robust sign of overfitting is when the model exhibits excellent metrics on the training set but performs poorly on a separate test set or real-world data. Hence, we typically reserve an untouched hold-out set to get an unbiased assessment of model performance before deploying it.
Potential Pitfalls and Edge Cases If the distribution of fraud changes over time, a model that seemed well-generalized might suddenly degrade in performance. This phenomenon is sometimes referred to as concept drift. Even if the model was not originally overfitting, a drastic shift in fraud patterns can make it appear as though the model is poorly generalized. A small fraud dataset can lead to potential “overfitting” to the few fraudulent examples available. Ensuring that the training set is representative or augmented responsibly is crucial.
Identifying Overfitting A sharp rise in the gap between training metrics (accuracy, recall, or precision) and validation/test metrics is one of the biggest red flags of overfitting. Monitoring learning curves (plotting the training and validation score over successive epochs or training iterations) can give insight into when overfitting begins. If the training score continues improving while the validation score stagnates or falls, overfitting is likely happening.
Possible Follow-up Question 7: How do we handle real-time scoring or streaming data, and what additional considerations arise when deploying a fraud detection model in a production environment?
Real-Time vs. Batch Scoring In real-time settings, when a new loan application arrives, the bank might want an immediate fraud risk assessment. This necessitates a model that is both fast and accurate. You typically must store the trained model in memory so that incoming data can be passed through quickly with minimal latency. Batch scoring is common when there is less urgency (e.g., a nightly job that processes all new applications in a queue). The trade-off is that in batch mode, you have more time for complex computations, but you might not catch fraud quickly enough for certain business needs.
Infrastructure and Deployment Deploying a fraud model typically requires an infrastructure that can handle spikes in incoming traffic (e.g., certain times of day when loan applications surge). Load balancing and microservices architecture can ensure the inference process remains efficient under high loads. Continuous monitoring is crucial. If you detect unusual swings in model output, that may signal model drift or a new type of fraudulent behavior.
Low Latency Requirements Sometimes, a manual review might not be feasible due to the volume or speed of transactions. In these scenarios, you cannot rely purely on a human-in-the-loop approach, so the real-time model’s accuracy and reliability become paramount. If it’s too conservative (many false positives), you risk overwhelming any human review team. If it’s too lenient (many false negatives), you suffer large financial losses.
Retraining and Model Updates When the model is deployed in real-time, you need an automated pipeline that captures newly labeled data, updates the training set, and retrains or fine-tunes the model at regular intervals if concept drift is detected. This also includes version control of the model so you can rollback if a new model performs worse than expected. Shadow deployment or A/B testing can be used for critical transitions. You can run the new model in parallel with the old model to evaluate performance in real-time without impacting actual business decisions until you’re confident in its performance.
Potential Pitfalls and Edge Cases A big risk in real-time systems is that fraudulent actors might quickly adapt to the model’s weaknesses if they can probe the system and test small changes to an application’s data. This adaptive nature of fraud is why consistent monitoring and quick retraining are crucial. Data engineering aspects, such as data pipeline latency and feature engineering steps (especially if they rely on external services), can cause bottlenecks or even incorrect feature values, leading to suboptimal detection.
Possible Follow-up Question 8: How do we address adversarial behavior where fraudsters deliberately try to circumvent the detection system?
Adaptive Fraud Tactics Fraudsters may observe what triggers the system to flag an application as “fraudulent.” Over time, they may evolve strategies—altering application attributes in subtle ways—to reduce the suspicion score. This scenario is akin to an adversarial game, where both sides (the detection system and the fraudsters) are continually adapting.
Techniques to Mitigate Adversarial Attacks Ensemble methods: By combining multiple models, it becomes harder for adversaries to find a single predictable weakness to exploit. Regular updates: Continuously retraining the model on the latest fraud patterns ensures it remains up-to-date against new tactics. Feature randomization or encryption: Some systems randomize certain features or how features are aggregated so that the exact logic is harder to reverse-engineer. Active learning: The system can be designed to proactively investigate “borderline” cases where the model is uncertain, thus quickly learning new fraud patterns.
Monitoring and Counter-Measures Institutions often have specialized teams monitoring large, suspicious shifts in the distribution of model inputs and outputs. These can be detected if many applications suddenly cluster in unusual ways or if there is a spike in loan defaults that were previously unflagged. Rule-based overrides or layered defenses can supplement the ML model. For example, if an application is flagged for suspicious IP or device signals, it might trigger additional checks even if the ML model’s probability is below the threshold.
Potential Pitfalls and Edge Cases Balancing transparency and robustness: Regulators or customers may demand explanations of how loans are scored, which means you might have to reveal aspects of the model. This can inadvertently help fraudsters reverse-engineer the system. Striking a balance between interpretable decision-making and security is not trivial. Fraud rings: Coordinated fraud attempts can circumvent simplistic anomaly detection if members of the ring share stolen identities or systematically manipulate variables that the model relies upon.
Possible Follow-up Question 9: What role does feature engineering play in improving fraud detection, and how do we decide which features are most important?
Importance of Feature Engineering In fraud detection, raw application data often requires transformations or the generation of new features. These can include time-based features (e.g., how many credit applications in the past 24 hours from the same device), external data sources (e.g., credit bureau checks), or derived ratios (e.g., requested loan amount relative to the applicant’s stated income).
Identifying Key Features Data scientists often use domain knowledge, correlational analyses, or feature importance methods (like SHAP values for tree-based models) to surface which features carry the most predictive power. For instance, an unusual IP geolocation might be highly indicative of fraud. Correlation analyses with historical fraud outcomes can reveal patterns, such as certain job titles, email domains, or phone number patterns showing higher fraud rates. However, caution is needed to avoid introducing bias or spurious correlations.
Automated Feature Engineering Tools like automated feature selection or transformation libraries can systematically generate candidate features, such as polynomial interactions or grouping categories in novel ways. These can be beneficial but risk overfitting if the candidate feature space becomes too large relative to the data size.
Feature Drift and Maintenance Fraudsters might shift their behavior to circumvent specific signals once they realize a certain feature is commonly used. This phenomenon implies that certain features may “burn out” over time and lose effectiveness. Ongoing monitoring of feature importance and performance is key. Additionally, if a data feed (e.g., a third-party identity verification API) changes its structure or starts introducing delays, it can break features in production.
Potential Pitfalls and Edge Cases Leakage: Some features might inadvertently leak future knowledge or outcomes that wouldn’t be available at the time of the real application decision. For example, if your dataset includes a post-approval field indicating actual loan repayment events, you can’t use that directly for real-time fraud prediction. Complex transformations: Excessive or overly complex transformations can hamper interpretability and cause deployment challenges, especially if real-time feature computation is cumbersome or expensive.
Possible Follow-up Question 10: How do we determine if the model needs recalibration, and what methods exist for calibrating the output probabilities?
Why Calibration Matters When a model outputs a probability of fraud, we often assume that a score of 0.8 means an 80% chance of fraud. However, many binary classifiers (like random forests or boosted trees) are not perfectly “calibrated” by default. This can cause problems if you rely on these probabilities to make threshold-based or cost-based decisions.
Common Calibration Methods Platt Scaling: Typically used for SVM or logistic outputs, though it can be applied more generally. It fits a logistic regression to the model’s output scores to produce well-calibrated probabilities. Isotonic Regression: A non-parametric approach to map raw output scores to probabilities. It can handle more flexible, monotonic relationships. Beta Calibration or Temperature Scaling: Often used with neural networks to adjust the “temperature” parameter in the softmax layer, improving how well the predicted probabilities align with observed frequencies.
When to Recalibrate If you observe that the model’s predicted probabilities do not match actual observed fraud rates, this discrepancy can lead you to recalibrate. For instance, you might discover that cases your model flags with a 70% fraud probability only turn out to be fraudulent 40% of the time in practice. A shift in data distribution can invalidate previously good calibration. Any sign of concept drift or new fraud patterns requires not just model retraining but also a potential recalibration step.
Potential Pitfalls and Edge Cases Overfitting in calibration: If you have limited data, isotonic regression can overfit, flattening or distorting the probability outputs if it tries to fit too closely to small sample quirks. Separate calibration sets: You need a separate calibration set or use cross-validation to calibrate and validate that calibration. Relying on the same data used for initial training can degrade reliability.
Possible Follow-up Question 11: How do we handle partial labels or uncertain labels when training a fraud detection model?
Nature of Partial or Uncertain Labels Sometimes you might not have perfect labels indicating whether a loan was truly fraudulent or not. For example, if the loan is still outstanding, you might suspect fraud but not have final confirmation. Or perhaps an investigation is underway. Similarly, the model might flag suspicious activity, but the bank never receives final evidence confirming it was fraud. These unknowns can introduce label noise into the dataset.
Techniques for Managing Uncertainty Semi-Supervised Learning: When only a small fraction of data is definitively labeled (fraud or not fraud), semi-supervised methods can utilize the larger unlabeled portion. For instance, methods like self-training or label propagation attempt to infer potential labels for unlabeled data, but must be used carefully to avoid compounding errors. Probabilistic Labeling: Instead of treating uncertain instances as strictly fraudulent or legitimate, you can assign a probability label to them. For example, “50% chance of being fraud” can then be incorporated into the training objective as a weighted target. Active Learning: If certain applications are more suspicious and can be labeled via manual investigation, an active learning approach focuses the labeling effort on the most informative samples. This helps gradually reduce uncertainty in the labeled dataset.
Potential Pitfalls and Edge Cases Overconfidence in partial labels: If you assume an unconfirmed case is definitely negative (non-fraud) merely because it hasn’t been confirmed as fraud yet, you risk introducing erroneous negative labels into the dataset. Introduction of bias: If the unlabeled set primarily consists of certain demographic or loan types, your model might learn incorrect correlations or systematically overlook new fraud patterns.
Possible Follow-up Question 12: In what ways can interpretability and explainability be incorporated into fraud detection models, and why is this crucial?
Importance of Explainability in Fraud Detection Banks and other lending institutions face regulatory requirements that often demand explanations for adverse actions (e.g., rejecting a loan). Explaining why the model flagged an application can help comply with these regulations. Building trust with customers is also important. If a legitimate applicant is flagged, providing a clear and comprehensible reason helps reassure them that the institution is not making arbitrary decisions.
Methods for Explainability Feature Importance: Techniques like Gini importance in random forests, or SHAP (SHapley Additive exPlanations) values for tree-based or neural network models, can highlight which features were influential in the decision. Local Surrogate Models: Tools like LIME (Local Interpretable Model-agnostic Explanations) create a simple interpretable model (like a linear model) locally around the prediction to show what drove the outcome. This is particularly useful if the global model is a complex ensemble or deep neural network. Rule Extraction: Some frameworks can approximate a complex model with a set of if-then rules. These rules are more interpretable, although they may lose some predictive power.
Challenges and Pitfalls Balancing interpretability with performance: The most accurate models are often ensembles or deep neural networks, which can be “black boxes.” Gaining interpretability might require additional steps or specialized tools. Sensitivity of certain features: Some highly predictive features might be sensitive attributes. If explaining the model’s decision reveals these features, the institution must ensure they are not inadvertently breaching privacy or fairness guidelines.
Possible Follow-up Question 13: How might the ROI of a more sophisticated model compare to a simpler rule-based approach, and when might a simpler approach still be preferable?
Assessing ROI A sophisticated machine learning model (e.g., a large ensemble or deep neural net) can potentially reduce fraud losses significantly if it identifies complex patterns that a simpler rule-based system would miss. The return on investment (ROI) for such a system can be high if the cost of implementation and maintenance is outweighed by fraud savings. However, building and maintaining advanced models can be expensive. It requires skilled data scientists, robust data engineering infrastructure, and ongoing monitoring. If the fraud volume is not large enough, or the cost of false negatives is not extremely high, a simpler approach may be more cost-effective.
Reasons Simpler Approaches Might Be Preferable Easier to interpret and explain to regulators. Lower development and maintenance cost. If the data does not support a complex model (e.g., too few fraud samples or poor data quality), simpler rules or logistic regression might suffice. In some very small or specialized markets, the number of applications might be so low that a fully automated system is unnecessary; human reviews or rule-based triggers can be adequate.
Pitfalls and Edge Cases Complacency: Relying on rules alone can be dangerous if fraud patterns shift rapidly or new types of fraud are not captured. Inflexibility: A rule-based system can quickly become outdated, requiring manual updates of rule thresholds or patterns. Hidden complexities: Even seemingly simple conditions might not scale well if the business expands, leading to a maintenance nightmare.
Possible Follow-up Question 14: How do we address situations where fraudulent applicants repeatedly modify their information in a single session to evade detection?
Real-Time Tracking of Changes Sometimes, a fraudster might fill out an application form, receive some feedback or suspicion that they’ll be flagged, and then alter certain fields—like phone numbers, addresses, or job titles—hoping to skirt the detection logic. Systems can log incremental changes to the application fields. For instance, if the IP address is the same but the name and address keep changing within a short time span, that’s a strong indicator of potential fraud.
Modeling State Transitions If you have a system that sees each field’s change event in near real-time, you can build a secondary model or rule set that detects “high-velocity changes.” For example, “any applicant that changes their income more than three times in one session might be flagged as high-risk.” Such meta-features are especially relevant in real-time web or mobile form submissions.
Pitfalls and Edge Cases False alarms: Some legitimate users might correct mistakes (e.g., they typed their address incorrectly) multiple times, or might be uncertain of how to fill the form, causing multiple field edits. Balancing the threshold for “suspicious repeated changes” is non-trivial. Technical complexity: Tracking these changes in real-time requires a robust session management system and advanced event tracking architecture that can parse and store each field modification efficiently.
Possible Follow-up Question 15: How might we use unsupervised learning or anomaly detection methods to complement the supervised fraud detection model?
Unsupervised / Anomaly Detection Motivation Supervised models require labeled examples of fraud. But what if novel fraud patterns emerge that have never been labeled or encountered? Unsupervised anomaly detection methods can spot unusual data points or transactions that deviate from the norm, potentially flagging new fraud types early.
Common Techniques Clustering-based methods: Group applications into clusters; instances that fall far from any cluster centroid or in a very small, outlier cluster might be suspicious. Autoencoders: Train a neural network to reconstruct “normal” loan applications. If the reconstruction error is high, the application might be anomalous. One-Class SVM: Learns a boundary around “normal” instances. Anything that falls outside this boundary is flagged as an outlier.
Integration with the Supervised Model One approach is to run the unsupervised anomaly detection model in parallel. If the supervised model does not flag the application, but the anomaly detector assigns a high anomaly score, the application might be routed for secondary review. This layered approach can catch novel or emerging fraud patterns that the supervised model was not explicitly trained on.
Pitfalls and Edge Cases High false alarm rates: Anomaly detection, by design, can produce a lot of false positives because unusual behavior is not always fraudulent. Fine-tuning how anomalies are prioritized or integrated with the main pipeline is important. Maintaining unsupervised models: You still need to periodically retrain or update the distribution of “normal” data as legitimate patterns shift over time. If the population distribution evolves, the anomaly detector might start labeling valid applications as anomalies.
Possible Follow-up Question 16: What unique considerations arise when we integrate third-party data sources (e.g., credit bureaus, identity verification APIs) into our fraud detection model?
Enriching Features with Third-Party Data Incorporating credit score data, identity verification flags, or public record details can significantly improve fraud detection accuracy. These external signals may reveal discrepancies not apparent from the bank’s internal data alone.
Reliability and Availability Third-party services can experience downtime or latency spikes. If real-time decisions rely on these data feeds, any unavailability can degrade the system. You might need fallback mechanisms, such as using cached data or proceeding with a partial risk score. Data quality can vary; if the third-party API occasionally returns incomplete or incorrect data, the model can be misled.
Security and Privacy Regulations Sharing personal identifiers with an external provider must comply with regulatory frameworks. Some jurisdictions have strict data protection laws that might limit the type or granularity of data you can send or store. Agreements with vendors or data providers must ensure data is used only for legitimate fraud detection purposes and adheres to privacy guidelines.
Model and Pipeline Adjustments Feature engineering steps might need to incorporate asynchronous data fetches. If the model is time-sensitive, you must design a pipeline that gracefully handles partial or late-arriving data. Depending on the cost of each API call, you might want a multi-tiered approach—only calling expensive APIs when the base model’s score is in a “gray zone.”
Pitfalls and Edge Cases Mismatch or stale data: The external source might have outdated personal information. Relying too heavily on it could lead to false positives if legitimate applicants recently moved or changed phone numbers. Complex error handling: If the external service returns an error code or timed out, you must decide how to proceed in your inference pipeline without that data.
Possible Follow-up Question 17: What steps would you take to prepare for auditing or regulatory inspections of your fraud detection system?
Comprehensive Documentation Maintaining a detailed record of the model’s design, training process, hyperparameter settings, and intended usage is key. Regulators might ask for explanations of how decisions are made and whether certain data fields are permissible. Version control of both the code and the model’s parameters (e.g., Git or a model registry) is crucial. Auditors may want to know which model version was live at a particular point in time and how it was trained.
Transparent Model Outputs Many regulations require organizations to provide adverse action notices or explain how certain decisions were reached. Having interpretable outputs or a post-hoc explanation method (e.g., SHAP) helps address these requirements without exposing proprietary model internals. Logs and traceability: The system should log the input features, the model’s output score, and the final decision for each application. This helps in forensic analysis if there’s a dispute or incident.
Compliance with Fair Lending and Anti-Discrimination Laws Regulators will scrutinize whether protected classes (e.g., defined by race, gender, age) are being treated unfairly. Even if the model does not explicitly use these attributes, it might use correlated attributes leading to disparate impact. Regularly testing for disparate impact or performing bias audits on your data and model outputs can help proactively address or mitigate discriminatory outcomes.
Pitfalls and Edge Cases Regulatory changes: Laws and industry standards evolve, so a model that was compliant last year might need updating. Jurisdictional differences: An international bank might have to comply with multiple regulatory regimes with different definitions of sensitive data and acceptable modeling practices.
Possible Follow-up Question 18: In a scenario where loan approvals are rare events, but the monetary impact is huge, how do we ensure robust evaluation of the fraud detection system?
Challenges with Low Volume of Approved Loans If approvals are rare—perhaps most applications are screened out early—a standard cross-validation approach might not capture the real-world distribution of truly approved loans. Also, the bank’s highest risk might come from the small fraction that gets approved and later defaults. Because data on those genuinely approved loans is limited, the model might lack sufficient examples of legitimate vs. fraudulent behaviors specifically among approved applications.
Data Acquisition and Simulation Backward-looking data: Often banks have historical logs of which loans were approved, which were rejected, and subsequent outcomes. However, policy changes over time might introduce selection bias. Counterfactual analysis: You might attempt to simulate what would happen if certain previously rejected loans had been accepted. This can be done in collaboration with the risk team, but it’s challenging because you lack real outcomes. Synthetic data generation: Carefully generating synthetic examples resembling the statistical properties of real loans can help the model learn patterns, though this introduces the risk of drifting away from real-world complexities.
Specialized Metrics and Validation Strategies You might segment your evaluation to focus specifically on the subset of data representing truly approved loans. This ensures the performance metrics accurately reflect how well the system catches fraud in the approvals pipeline. A multi-stage approach to modeling (first filtering out obviously fraudulent or non-qualifying applicants, then applying a second, more detailed model to the borderline cases) can help concentrate on the portion of the pipeline with the highest risk.
Pitfalls and Edge Cases Extrapolation beyond known data: The model might see legitimate loan requests that are quite different from those historically approved, leading to inaccurate predictions. Policy changes or economic fluctuations might shift the type of applicants or the reasons for default, requiring continuous recalibration or additional socioeconomic features to maintain predictive accuracy.
Possible Follow-up Question 19: How do we manage the scenario where the cost of investigating false positives is itself not a fixed amount, but scales with the number of daily flagged applications?
Variable Investigation Costs When the number of flagged applications is moderate, each flagged case might get a detailed manual review. However, if the false positive rate spikes, the investigation queue can become backlogged, requiring additional manpower or automated triage systems. This means the cost of false positives is not simply “cost_fp × number of FPs.” Instead, it could increase if you have to bring in more staff or if each investigation becomes slower and more prone to errors due to reviewer fatigue.
Model Threshold Adjustment When potential investigation costs are dynamic, you might incorporate a function that accounts for the diminishing returns of investigating large volumes of cases. One possibility is modeling the relationship between daily flagged volume and the average cost per flagged application (e.g., a piecewise or nonlinear function that grows as the queue expands). Adjusting thresholds might be done in real-time if the system detects an unusual spike in flagged applications, temporarily increasing the threshold to keep manual reviews at a manageable level.
Operational Constraints Budget or resource constraints may limit how many suspicious cases can be reviewed within a given time frame. This operational limit can guide the threshold selection so that the daily queue of flagged applications does not exceed review capacity. If flagged cases exceed the capacity, some might be automatically approved or handled via a less thorough check, which could then increase the risk of letting fraud slip through.
Potential Pitfalls and Edge Cases Complex interplay of cost and risk: If you set the threshold higher to reduce false positives, you might end up increasing false negatives, which can be far more expensive. Finding this delicate balance requires sophisticated cost modeling. Real-time overload: A sudden wave of suspicious applications (e.g., a coordinated fraud attack) might suddenly spike the false positive queue if the model is not adaptive. This can overwhelm investigators.
Possible Follow-up Question 20: If we integrate a credit risk model and a fraud detection model into a single pipeline, how do we prevent unintended interactions or inconsistent decisions?
Separation of Business Logic Fraud detection focuses on identifying malicious intent or suspicious behavior. Credit risk scoring focuses on the likelihood of repayment. Integrating these two can yield confusion if not carefully designed, as a customer might be a good credit risk but suspicious from a fraud perspective, or vice versa. Often, institutions maintain two distinct scores: a Fraud Score and a Credit Risk Score. The final decision might be based on rules that combine both (e.g., if Fraud Score is very high, deny automatically, otherwise factor in the Credit Risk Score).
Model Explainability and Consistency If a single combined model tries to learn both tasks simultaneously, it might conflate inability to repay with fraudulent behavior, which are not necessarily the same. A customer with poor repayment capability is not necessarily committing fraud. To handle this, you might opt for a multi-output architecture or keep the models separate and then fuse their outputs. This ensures clarity in the results: “High-risk for default, but not flagged for fraud,” or “Good credit risk, but flagged for suspicious identity signals.”
Implementation and Maintenance Because credit risk and fraud detection are different but related domains, you might have different teams or different data pipelines generating the features. Coordinating these pipelines in production must be carefully planned to ensure data synchronization. Regular calibration and cross-functional reviews can avoid a situation where changes in the credit risk model inadvertently affect the fraud detection pipeline.
Potential Pitfalls and Edge Cases Regulatory scrutiny: If a combined model inadvertently uses protected attributes or correlations that lead to discriminatory credit approvals, the institution faces heightened compliance risk. Overcomplication: A single monolithic model might become difficult to interpret or maintain. Splitting tasks into specialized models can be simpler operationally and legally.
Possible Follow-up Question 21: How might human-in-the-loop systems be leveraged for borderline cases, and what are the drawbacks or complexities of such an approach?
Human-in-the-Loop Setup In fraud detection, a model might automatically approve low-risk applications and automatically reject extremely high-risk applications. For the “gray zone” in the middle, human underwriters or investigators perform manual reviews. This is human-in-the-loop decision-making. This strategy optimizes resource usage, ensuring investigators focus on applications that truly require nuanced judgment.
Improved Accuracy Through Feedback The human reviewers’ judgments can be fed back into the system as labeled data, improving the model over time. If the model sees consistent patterns in borderline fraudulent applications that humans catch, it learns to catch those more effectively in the future. Likewise, if the human review reveals that certain flags were false alarms, the model can adapt or calibrate thresholds to reduce such mistakes.
Drawbacks and Complexities Throughput and consistency: Human reviewers might vary in their expertise or attention to detail. Ensuring consistent application of criteria is a challenge. Scalability: If the number of borderline cases grows significantly, the review team might become a bottleneck. This can slow down loan processing times and frustrate legitimate customers. Bias: Humans are not immune to biases. If they harbor unconscious biases against certain groups, these may get reinforced in the feedback loop, reflecting in subsequent model updates.
Edge Cases Parallel or distributed review tasks: If multiple reviewers are assigned different subsets of the borderline cases, you must ensure a process for reconciling conflicting judgments or addressing random variance in their decisions. Training data mismatch: The model sees borderline cases labeled by humans, but might have an incomplete view of extremely low-risk or high-risk segments because those were auto-decided by the system. Potentially, this can skew future model retraining.
Possible Follow-up Question 22: What are the unique challenges with data privacy and security when dealing with sensitive personal information in fraud detection systems?
Storing and Handling Personally Identifiable Information (PII) Loan applications typically contain highly sensitive information (e.g., Social Security numbers, addresses, birth dates). Ensuring that the database storing this information is encrypted and access is tightly controlled is paramount. Role-based access control should be enforced so that only authorized personnel and systems can view or process specific data fields.
Regulatory Compliance Regulations such as GDPR (in the EU) or CCPA (in California) can grant individuals the right to be forgotten or to request data access. A fraud detection system must be designed to accommodate such requests, which can be technically complex if data is scattered across multiple pipelines. Some jurisdictions forbid storing certain data fields indefinitely. You must design the system so that data used for training older models is either anonymized or removed after a specified retention period.
Anonymization and Aggregation One approach is to anonymize or pseudonymize sensitive fields to reduce risk if there is a data breach. You might store hashed versions of phone numbers or addresses. The trade-off is that it might degrade the model’s ability to detect certain forms of fraud if raw data is needed for pattern recognition. Aggregated or token-based lookups can preserve some level of detection capability while reducing the risk associated with storing raw PII.
Pitfalls and Edge Cases De-anonymization risk: Even aggregated or hashed data can sometimes be “re-identified” if an attacker has external data or if the hashing strategy is weak. Data sharing with third parties: If you send data to external identity verification services, you must ensure the data transfer is secure and that the third party also abides by privacy regulations. Model explainability vs. privacy: Detailed explanations might inadvertently reveal sensitive data or inferences about an individual’s identity or habits. Striking a balance is not trivial.
Possible Follow-up Question 23: How do we measure the long-term effectiveness of a fraud detection model, and what metrics or strategies help monitor performance over extended periods?
Long-Term Monitoring vs. Snapshot Evaluations A one-time test set evaluation might not reflect how well the model will do in six months or a year, given changes in fraud tactics and economic conditions. Continuous or periodic monitoring helps track trends in both false positives and false negatives over time.
Key Long-Term Metrics Roll-rate or default rate for flagged vs. unflagged loans: This helps measure if the model is effectively capturing truly risky behavior. Population stability index (PSI) or similar measures: Evaluate if the input distributions to the model are shifting. Large distribution shifts may signal model drift and degrade performance. False negative “tail events”: Because the cost of a false negative can be disproportionately large, monitoring and analyzing each actual fraud case (that was initially classified as legitimate) is critical for iterative improvements.
Strategies for Ongoing Evaluation Periodic auditing of flagged and non-flagged cases: Manual sampling of certain segments can reveal if the model is missing subtle new fraud patterns. Feedback loops with other departments: The collections or investigations team might provide insights on recently discovered fraud cases, which can be fed back into the training data for retraining or fine-tuning. Canary or champion-challenger models: Running an old model (champion) in parallel with a new model (challenger) to compare their performance on live data can provide a safety net. If the new model underperforms, you can revert to the champion.
Pitfalls and Edge Cases Seasonal or cyclical changes: Fraud rates might spike during holiday seasons or economic downturns, causing previously stable performance metrics to shift. The monitoring system needs to account for these seasonal patterns. Retaining historical data for trending: Ensuring you keep enough historical labeled data to analyze year-over-year trends can be challenging with storage and privacy constraints.
Possible Follow-up Question 24: How can partial automation, rules, and machine learning co-exist in a large organization’s fraud detection strategy?
Layered Defense Concept Many organizations use a layered approach: Some basic rules or heuristics filter out obviously fraudulent applications (e.g., suspicious phone numbers known from a negative list), while the machine learning model handles more nuanced decisions. Higher-level rule triggers might route certain cases to specialized investigation teams. This layered approach can provide efficiency: cheap rule checks can quickly discard the most blatant attempts, freeing the ML model to focus on more complex patterns.
Advantages of Combining Methods Speed: Simple deterministic rules can be executed extremely quickly and might capture a chunk of fraud attempts with minimal overhead. Interpretability: Rules can be easily explained to auditors or management. “If the IP address is from a known blacklisted range, deny automatically.” Flexibility: The ML model can adapt to new, sophisticated attacks that rules alone might miss.
Risk of Over-Reliance on Rules Over time, a large collection of rules can become unwieldy, leading to contradictions or duplication. This can lead to maintenance challenges and possibly conflicting or overlapping logic. If fraudsters discover specific rule thresholds, they can easily tailor applications to dodge them.
Pitfalls and Edge Cases Decision collisions: An application might be flagged by conflicting rules or pass rules but be flagged by the ML model. The organization must define precedence clearly. Rules might degrade over time, requiring an annual or even quarterly review to ensure they remain relevant. A stale rule set could incorrectly block legitimate customers or fail to catch novel fraud patterns.
Possible Follow-up Question 25: How might we evaluate or benchmark our fraud detection system against competitors’ solutions or industry standards?
Motivation for Benchmarking Executive teams or stakeholders might ask, “Are we doing better or worse than the industry average at detecting fraud?” Competitors might be using different techniques, data sources, or heuristics. Benchmarking provides context and can justify further investments in data science.
Data Sharing Challenges Confidentiality: Banks rarely share raw data on their fraud cases due to privacy and competitiveness. Even anonymized data sets might reveal insights into transaction volumes or other sensitive information. Third-party benchmarking services: Sometimes consultants or specialized vendors collect data from multiple sources and provide aggregate benchmarks. You might compare your false positive rate, recall, or cost savings to an industry baseline.
Standardized Metrics Common metrics, such as area under the ROC curve (AUC), area under the Precision-Recall curve (AUPRC), or a cost-based measure, can be used for a fair comparison. However, each institution’s data distribution (loan size, demographic, product type) can differ significantly, which means the same metric might not be directly comparable across organizations.
Pitfalls and Edge Cases Over-focusing on external benchmarks: A system highly optimized to a universal benchmark might neglect specific local nuances in your organization’s application flow, leading to suboptimal real-world performance. Differences in definition of “fraud”: One institution might define certain borderline behaviors as “fraud,” while another might not, so direct comparisons can be misleading if definitions are not aligned.
Possible Follow-up Question 26: How do we allocate engineering and data science resources to balance speed of deployment vs. depth of model experimentation for fraud detection?
Resource Trade-Off On one hand, the business might push for quick deployment to reduce ongoing fraud losses. On the other hand, data scientists may want more time for model experimentation—tuning hyperparameters, engineering features, and exploring advanced architectures—to maximize detection performance. In a large organization, product managers or stakeholders often need an MVP (Minimum Viable Product) to prove the concept. After that, iterative improvements can be rolled out.
Incremental Development A phased approach can be used where a simpler model or rule-based system is deployed first to gain immediate fraud reduction. Simultaneously, a data science team works on a more sophisticated system in the background. Once it’s validated, you can do an A/B test or champion-challenger rollout. This allows you to capture some cost savings early while still developing a state-of-the-art solution.
Pitfalls and Edge Cases Technical debt: Deploying a hasty solution can accumulate technical debt if it’s not refactored or replaced eventually. Over time, patching an MVP might become costlier than building a robust system from scratch. Team communication: Data scientists and engineers must remain closely aligned so that the handoff from experimentation to production is smooth, and the model’s theoretical improvements translate into operational benefits.
Possible Follow-up Question 27: What are best practices for labeling and verifying fraud cases post hoc, and how does this process impact future model updates?
Post Hoc Labeling Often, true fraud labels become available weeks or months after loan origination if the customer defaults in a suspicious manner or if internal investigations confirm fraudulent documents. Maintaining a well-organized pipeline that automatically updates these labels in the model training dataset is vital. Additionally, disputed or pending cases need a system of record that updates the label once the dispute is resolved.
Importance of Accurate Labels The entire performance of the model depends on the correctness of the fraud labels. Mislabeling a case as fraudulent when it isn’t can skew the model’s understanding, potentially causing an overestimation of certain suspicious features. If a large portion of true fraud cases are never correctly labeled, the model might systematically underestimate the risk. This underlabeling can lead to more false negatives.
Continuous Improvement Once new labels are in, you can retrain the model on a rolling basis. This ensures that the latest confirmed fraud patterns are integrated. For instance, scheduling monthly or quarterly updates might strike a balance between stability and responsiveness to new data.
Pitfalls and Edge Cases Lag in label availability: The time lag between the loan disbursement and detection of fraud can cause the model to be trained on incomplete data if you don’t systematically refresh it. Label drift: The definition of fraud might change slightly over time if the institution refines its policies or if the industry’s notion of fraudulent behavior evolves. A versioned approach to labels is needed to maintain consistency.
Possible Follow-up Question 28: Could you discuss the trade-offs between building an in-house fraud detection solution vs. purchasing a third-party solution?
In-House Development Pros: Full control over the data, model architecture, and updates. Customization to the organization’s unique lending products and workflows. Ownership of intellectual property and the ability to integrate deeply with internal systems. Cons: Requires specialized data science talent, engineering resources, and ongoing maintenance. Longer time to market if the internal team is not already set up or lacks domain expertise.
Third-Party Solutions Pros: Faster initial deployment, as vendors typically provide out-of-the-box functionalities. Benefit from a vendor’s accumulated domain experience and potentially large aggregated data sources (though aggregated data might not always be shared). Cons: Less flexibility; custom feature requests might be slow or costly to implement. Vendor lock-in risk; data privacy concerns if sensitive information is shared. Cost can be significant in the form of licensing or usage-based fees.
Pitfalls and Edge Cases A hybrid approach may be necessary where a core vendor solution is integrated with in-house rules or smaller models to address the organization’s specific niche needs. If the vendor solution’s performance or data access is suboptimal, the bank might be stuck unless it invests significantly in a switch to another provider or a fully in-house system.
Possible Follow-up Question 29: How do you incorporate a feedback loop from investigators or underwriters into model updates to ensure the system learns from edge cases?
Structuring the Feedback Loop When investigators or underwriters review a flagged application, their determination (fraudulent vs. legitimate) should be recorded in a centralized system. This label becomes ground truth once verified. You may also gather additional notes about why they decided it was fraudulent, what signals were crucial, or if any new suspicious patterns were observed. These qualitative insights can guide feature engineering or rule creation.
Integration into Retraining The newly verified fraud or non-fraud applications get appended to the training dataset. After a sufficient volume of new labels accumulates, the model can be retrained or fine-tuned. This ensures the model is up-to-date with the latest real-world feedback. If an especially novel fraud pattern emerges, you might expedite this cycle, quickly retraining a model or adjusting thresholds to catch the pattern more effectively.
Pitfalls and Edge Cases Data pipeline reliability: Ensuring every manual review label consistently flows back into the data warehouse is critical. Missing labels or misalignment between the labeling system and the training pipeline can degrade performance. Conflicting feedback: Different investigators might have differing opinions. Some might be more conservative (label borderline cases as fraudulent), others more lenient. A consensus or adjudication process may be necessary to finalize the labels.
Possible Follow-up Question 30: How can we test the robustness of the fraud detection system under stress scenarios, such as a sudden surge in application volume or novel fraud attempts?
Load Testing Simulate a large number of applications in a short timeframe to ensure the system’s inference service does not buckle under heavy load. If the system becomes unresponsive or extremely slow, real fraud detection might suffer from timeouts or degrade the customer experience for legitimate applicants. You can use standard load-testing tools to measure average and peak latency, ensuring it meets the business service-level agreements (SLAs).
Adversarial Testing Create synthetic or semi-synthetic datasets that mimic potential new fraud tactics. Introduce anomalies or pattern changes to see how quickly the model’s performance drops or if it triggers the appropriate alerts. A red team exercise can also be conducted, where internal or external experts try to beat the system by crafting fraudulent applications.
Fail-Over Scenarios What if the main model deployment fails or the feature extraction service experiences outages? Ensure you have fallback approaches, such as a simpler model or rules, so the detection pipeline does not collapse entirely. Monitor system health and set up alerts so that operational teams can respond immediately to anomalies in throughput or accuracy.
Pitfalls and Edge Cases False positives in high-load conditions: Sometimes, memory or concurrency issues can cause partial feature computation or incorrect data merges, leading to a spike in false positives. Unforeseen data format changes: If the input data pipeline changes during a surge, it might lead to mismapped features (e.g., columns shifted), causing unpredictable model behavior.