ML Case-study Interview Question: Semi-Supervised Deviation Networks for Food Delivery Fraud Detection
Browse all the ML Case-Studies here.
Case-Study question
You are tasked to reduce fraudulent claims in a large hyperlocal food delivery platform. The platform notices users raising false refund requests, causing heavy financial loss. You have access to a small percentage of labeled data (fraud vs. non-fraud) and a vast corpus of unlabeled claims. Propose a machine learning approach to detect fraud in real-time with high precision, ensuring legitimate users are not wrongly flagged. Design your solution pipeline, including data preparation, model selection, training, and real-time deployment strategy. Suggest ways to handle contaminated unlabeled data. Present details on how to measure success using standard fraud detection metrics.
Detailed Solution
Data Preparation Start by gathering all historical claims data. Split data into labeled and unlabeled sets. Keep a reserved portion of labeled data for validation. Check for class imbalance because fraud ratios are low. Create numeric and categorical features such as user history, claim type, and refund amount. Convert categorical variables via one-hot encoding or embeddings. Perform standard cleaning steps like outlier removal and missing value imputation.
Semi-Supervised Framework Use a semi-supervised anomaly detection method. Rely on the small set of labeled anomalies (fraud) and normal data (non-fraud) to guide model training. Address contamination in unlabeled data with a robust mechanism that reduces reliance on suspect samples.
Deviation-Based Model (DevNet) Implement a model that learns anomaly scores for each claim, pushing fraud scores to be distinctly higher than non-fraud. The original DevNet approach uses a Z-score-based deviation measure to push anomalies away from a learned mean and to keep normal samples within a certain distribution.
Use phi(x;Theta) to represent the learned anomaly score function with parameters Theta. Use mu_R and sigma_R to represent the mean and standard deviation of the anomaly scores for labeled anomalies.
Contrastive Deviation Loss Define a contrastive loss that encourages high anomaly scores for fraud and keeps normal scores within a tight range.
$$ L_{dev} = \begin{cases}
\max\Big(0, a - \big(\frac{\phi(x;\Theta)-\mu_R}{\sigma_R}\big)\Big) & \text{if } y=1\ \max\Big(0,\big(\frac{\phi(x;\Theta)-\mu_R}{\sigma_R}\big) - a\Big) & \text{if } y=0 \end{cases} $$
Use y=1 for fraud and y=0 for normal. Use a for controlling the margin of separation.
KNN-Based Variational Scaling Incorporate a scaling factor for unlabeled samples to reflect their similarity to labeled fraud or labeled non-fraud. Use a K Nearest Neighbor model on the labeled set to assign each unlabeled sample to a fraud or a non-fraud cluster. Compute the average distance of each unlabeled point to points in its assigned cluster.
Use d_{avg}(p) for the average distance of unlabeled point p to points in its assigned cluster. Use max(d_{avg}) for the maximum distance among all unlabeled points. This gamma_x is higher if p is closer to the non-fraud cluster. Multiply this scaling factor in the DevNet loss for unlabeled data so misclassification penalties are stronger when a sample is very similar to known non-fraud.
Implementation Example (Python) Explain the training loop with focus on how scaling factor is computed and applied:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
import torch
import torch.nn as nn
import torch.optim as optim
# Assume we have:
# X_labeled, y_labeled -> Labeled data
# X_unlabeled -> Unlabeled data
# model -> Neural network for anomaly scoring
knn = KNeighborsClassifier(n_neighbors=51)
knn.fit(X_labeled, y_labeled)
# Assign cluster label to each unlabeled sample
cluster_assignment = knn.predict(X_unlabeled)
# Compute average distances
def avg_distance(point, cluster_points):
return np.mean(np.sqrt(np.sum((cluster_points - point)**2, axis=1)))
distances = []
for i, x_unlab in enumerate(X_unlabeled):
assigned_label = cluster_assignment[i]
cluster_data = X_labeled[y_labeled==assigned_label]
dist_val = avg_distance(x_unlab, cluster_data)
distances.append(dist_val)
max_dist = max(distances)
scaled_factors = [1 - (dist / max_dist) for dist in distances]
# DevNet-style training loop with scaling
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss() # placeholder, real implementation is custom
for epoch in range(num_epochs):
for batch in dataloader:
# batch -> mixture of labeled and unlabeled samples
x_batch, y_batch, gamma_batch = prepare_batch(batch, scaled_factors)
anomaly_scores = model(x_batch)
loss = custom_devnet_loss(anomaly_scores, y_batch, gamma_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Explain that prepare_batch merges a fraction of labeled data plus unlabeled data (with gamma factors). Explain that custom_devnet_loss implements the deviation loss with the scaling factor.
Real-Time Deployment Use the trained model to score each incoming claim. Set a threshold that triggers investigative action if the score is too high. Monitor precision to avoid banning genuine users. Retrain periodically with fresh labeled examples. Adjust KNN parameter k if more labeled data becomes available.
Success Metrics Track Precision because false positives are costly. Track Recall to ensure enough fraud is flagged. Track Block Rate to assess the fraction of flagged claims. Adjust thresholds to find an optimal trade-off. Compare the approach to older methods and measure improvements in real business metrics.
Follow-Up Question 1
How would you select the number of neighbors k in the KNN module?
Answer Pick k by balancing cluster assignment accuracy vs. over-smoothing. Use a portion of labeled data to evaluate different k values. Check if the assigned cluster for unlabeled points aligns with the actual label in a small labeled subset. If k is too large, distant neighbors might introduce noise. If k is too small, cluster assignments become unstable. Evaluate performance metrics (precision, recall, block rate) on a validation set and pick the k that delivers the best overall trade-off.
Follow-Up Question 2
What if the data is highly imbalanced and some labeled classes are rare?
Answer Apply sampling or weighting strategies. Use class weights in the KNN step or oversample the minority class. Reflect class imbalance in the training loop by drawing more samples from the minority fraud class. Keep the same approach for generating scaling factors but ensure the labeled set has a sufficiently balanced representation.
Follow-Up Question 3
How would you handle concept drift if fraud patterns change over time?
Answer Re-train the model at regular intervals with fresh data. Keep track of newly identified fraud examples. Update the KNN model to incorporate new labeled points. Deploy an online or incremental learning version of the pipeline if possible. Monitor performance metrics continuously and flag sudden drops in precision or recall as an indication of drift.
Follow-Up Question 4
What if the KNN-based distance metric is too slow on large datasets?
Answer Use approximate nearest neighbor techniques like Faiss or Annoy. Reduce dimensionality with PCA or autoencoders to speed up searches. Use partition-based search methods if real-time lookups are required. Parallelize distance computations across multiple workers or GPUs.
Follow-Up Question 5
Why not treat all unlabeled data as normal for anomaly detection?
Answer That assumption creates contamination. Many unlabeled samples might be fraudulent, so the model sees them as “normal.” This reduces separation between fraud and non-fraud scores. The scaling factor helps offset this by weighting how each unlabeled sample aligns with known fraud or non-fraud, improving resistance to contamination.
Follow-Up Question 6
How would you handle evolving feature sets over time?
Answer Perform feature engineering that adapts to new signals. If new features emerge (like user location changes), add them into your pipeline. Retrain or fine-tune your KNN and main model with updated features. Use a robust pipeline structure that can accommodate feature-level changes without entirely resetting the model.
Follow-Up Question 7
If your approach flags fewer fraudulent claims than desired, how would you fix it?
Answer Relax the detection threshold, which raises recall but can lower precision. Introduce additional features to better separate fraud from normal. Increase the presence of labeled fraud examples in the training set. Use an ensemble approach by combining the anomaly score with other heuristics. Track how changes affect real-world metrics.
Follow-Up Question 8
How do you ensure these techniques can scale in production?
Answer Integrate streaming-based data ingestion and scoring. Use mini-batching or microservices that handle parallel inferences. Distribute the KNN step with a search index. Use efficient data pipelines for real-time model updates. Monitor latency to confirm timely responses for user requests.
Follow-Up Question 9
What if certain features have missing values?
Answer Apply imputation or specialized transformations. Train the model with realistic data distributions. Consider feature dropping if a feature is missing too often. Evaluate the effect of missing values on the scaling factor. Ensure missingness doesn’t bias cluster assignments.
Follow-Up Question 10
How can you measure the business impact of your model’s decisions?
Answer Compare the financial losses from fraud before and after deployment. Track user churn caused by false positives. Assess how many fraudulent cases are blocked vs. how many good users are incorrectly flagged. Quantify net savings or revenue improvements by contrasting refunded amounts with system detections.