ML Case-study Interview Question: Explainable Unsupervised Anomaly Detection for Evolving Marketplace Fraud

Rohan Paul

Apr 12, 2025

Browse all the ML Case-Studies here.

Case-Study question

A marketplace platform faces an expanding catalog of fraud types. Payment fraud, phantom deliveries, and GPS spoofing disrupt the ecosystem of service providers and customers. The platform’s current system relies on supervised learning with labeled data, but new fraud tactics emerge quickly. Design an unsupervised anomaly detection solution to identify suspicious users and entities for manual review. Outline how you would generate features from multiple event streams, train and tune the anomaly model at scale, and explain flagged anomalies so that risk agents can confidently take action.

Connect with me on X (Twitter)

Detailed Solution

Overview of the Proposed Anomaly Detection Platform

A specialized platform can house unsupervised models. It should load events and generate entity-level features automatically. It should then perform time-based anomaly detection on these features, highlight suspicious entities, and serve explanations for manual review.

Entity Feature Generation

Each event involves multiple entities. For a trip request, there might be a customer, driver, payment instrument, device, etc. A pipeline should generate aggregated metrics across time windows for each entity. For example, number_of_trips in the past 7 days, or total_refunds in the past 30 days. A generic entity feature generation engine simplifies this step by letting data scientists define a base set of metrics (mean, sum, count) that get computed automatically across all time windows, events, and entity types. This yields a large feature set, but speeds up modeling since the features are already available.

Handling High Variance

Fraud-related data can vary widely. A legitimate user might have very large usage in a short span. Another might have minimal usage yet still be normal. Normalizing entity behaviors against historical baselines helps reduce false positives. A time-series approach can capture how each entity’s usage shifts over time. These transformations make outliers more conspicuous in a high-dimensional feature space.

Model Algorithms

Tree-based or neural-network-based anomaly detection algorithms can cluster typical behaviors together. Entities that do not fit any established pattern become flagged. The platform can offer multiple algorithms so that data scientists can pick the best fit. Models should be trained on large historical data with normal and possibly known suspicious samples removed (if they contaminate the population).

HAIFA Explanation Mechanism

Flagged anomalies must be explainable to risk agents.

One core idea is to compute, for each feature, a fine-grained histogram that covers all “normal” observations. Each bucket has a count of how many normal points fall in that bucket’s range. A flagged entity is inspected to see which features map to a very small bucket. Those features with minimal counterparts indicate how this entity is unusual.

T represents the smallest bucket proportion threshold so that each anomalous entity has at least one feature bucket below T. A binary search can find T automatically instead of requiring manual tuning.

Tuning and Noise Reduction

An anomaly model can surface many false positives. Engineers must examine which features trigger most anomalies. If these features are noisy or subject to data quality issues, they should be removed, transformed, or fixed upstream. Explanation outputs guide this cleanup cycle. By iterating, the noise level decreases until the flagged anomalies make business sense.

Operational Workflow

A data scientist writes a short Python configuration script.
The pipeline generates entity features from raw events stored in Apache Hive (or other data stores).
The anomaly detection model runs on a chosen algorithm.
HAIFA or a similar mechanism provides feature-level explanations for each flag.
Fraud agents review suspicious entities and decide if action is warranted.

Follow-up Questions

How do you determine the right number of features to include in the model?

Feature redundancy can cause overfitting or inflate compute costs. The pipeline might produce thousands of aggregated features. Filtering is necessary. The typical approach is to:

Eliminate features with near-zero variance.
Investigate correlation among features. Remove highly correlated ones.
Look at feature importance from a test run of the model. Remove features that do not contribute signals. Such methods reduce dimensionality without sacrificing performance.

How do you handle concept drift for fraud detection?

Fraud tactics change over time. A once-rare fraud pattern can become common. Unsupervised models may fail to adapt if the data shifts too drastically. A rolling retrain schedule is common. For example, train a new model monthly on a new window of data. Compare the new model’s performance against the previous model. Automated triggers can flag large changes in feature distributions, prompting immediate retraining or manual checks.

How would you deploy this solution at scale?

A robust pipeline must:

Periodically update features from a high-throughput event source (e.g., streaming or scheduled jobs).
Distribute the training workload across a cluster, possibly using an in-house ML platform or open-source libraries supporting distributed computing.
Save the trained model to a serving system that can quickly evaluate new entities in near real time.
Provide real-time explanations for each anomaly.

How do you prevent generating too many false positives for manual review?

Excessive false positives waste reviewer time. Three approaches help:

Curate and filter out known benign anomalies (e.g., big spenders with legitimate receipts).
Tune anomaly score thresholds by measuring the acceptance rate from a pilot test with risk agents.
Build a prioritization logic. Rank flagged entities by severity and confidence so agents see the highest-risk ones first.

How do you monitor and improve the system post-deployment?

An anomaly system needs continuous feedback. Fraud agents might confirm or reject flagged cases. Use those confirmations to guide further refinements. Track key performance metrics such as:

Precision: fraction of flagged entities that are fraudulent.
Recall: fraction of fraudulent entities caught.
Reviewer load: volume of cases each reviewer handles daily. When metrics degrade, investigate data drift, feature distribution shifts, or new fraud patterns not captured by current features.

Would you combine supervised and unsupervised approaches?

A hybrid method can be powerful. Supervised models learn known fraud patterns precisely. Unsupervised models detect novel behaviors. Both outputs can feed a rule-based system that merges results. This allows maximum coverage of known fraud while still catching emerging risks.

Are there privacy or compliance issues to consider?

Fraud detection often uses personal or payment data. Privacy regulations may require minimal data access or certain retention limits. Features must be carefully chosen to meet legal guidelines. Review processes must align with internal policies and applicable laws to ensure that flagged users have appropriate notices or procedures.

What if an anomaly detection algorithm marks a prominent, valid customer as fraudulent?

Manual review gates the system. Agents see the anomaly score and the top anomaly features. If they confirm the activity is legitimate, they can override the flag. The platform can store this override for future training. The system improves its understanding of what normal usage looks like for that type of customer profile.

Could you integrate advanced neural approaches like autoencoders or graph embeddings for anomaly detection?

Yes. Autoencoders can capture normal behaviors in a compressed embedding. High reconstruction error signals anomalies. Graph embeddings can capture relationships among riders, drivers, devices, etc. Entities forming unusual graph structures can be flagged. Both methods integrate well with an entity-level approach.

What would you do if the data is too sparse for certain features?

Sparse data can be an issue, especially in rare user-activity segments. Common strategies:

Reduce dimensionality by combining categories.
Apply smoothing or bucket-based grouping for continuous features.
Use flexible feature transformations like log-scaling or robust encodings. If a feature remains extremely sparse, removing or carefully aggregating it can improve stability.

How do you show managers that your solution has an acceptable return on investment?

Track how many confirmed fraud cases the system catches and estimate the monetary value saved. Also measure operational costs, including engineering and manual review time. If the net benefit outweighs the ongoing costs, the return on investment is demonstrable. Over time, optimize the pipeline to reduce overhead while catching more fraud.

How do you test this system with historical data that might be incomplete or biased?

Historical data might not reflect the latest fraud. Back-testing is still necessary. Train on older windows, test on new data with known outcomes. Tag suspicious transactions flagged by prior supervised systems. Measure how many known fraudulent entities this unsupervised system catches. Validate the model’s stability by observing how it handles random segments of data with varying patterns.

How would you handle real-time vs. batch scoring?

For batch scoring, the system can train on a weekly or monthly cadence and flag outliers. For real-time, you might score an entity each time it triggers an event. Real-time detection can be more complex in concurrency and data latency. Caching partial aggregates is critical so that the platform can quickly update entity features and anomaly scores.

How do you incorporate domain expertise from risk analysts?

Risk analysts might know typical usage patterns for certain markets or have rules of thumb about suspicious behaviors. They should suggest initial feature transformations or known suspicious signals. The unsupervised system can combine these domain-based insights with automatically derived features. Analysts can then review anomalies and provide feedback to refine the pipeline.

What if the marketplace expands to new regions with different user behavior?

New regions might bring different operating norms. The system should be region-aware in its features. A separate model for each major region or a combined model with region as a feature might be considered. The pipeline’s time-series normalizations should adapt quickly as data from that region grows.

How do you ensure the model does not discriminate or produce biased flags?

Bias can arise if certain user groups are overrepresented in anomalies. Ongoing fairness checks can measure if flagged anomalies align suspiciously with demographic groups. If so, investigate root causes in the features and distribution. In some cases, adjusting or removing sensitive features can reduce unintended biases.

How do you approach iterative model refinement?

After each deployment, gather feedback from flagged cases and agent feedback. Study false positives and false negatives. Update features or thresholds. Retrain and test. This cycle repeats until the system stabilizes at acceptable error levels. Regular re-tuning addresses evolving fraud patterns and distribution shifts.

Could online learning techniques help?

Yes. In streaming scenarios, partial fits of incremental algorithms can adapt. If the data arrives at high velocity, batch training may lag behind new fraud patterns. Online learning or micro-batch updates can maintain model freshness, but also demand robust infrastructure and careful resource management.

How do you handle scaling challenges as the user base grows?

A large user base means more entities and events, so the feature generation pipeline and the anomaly detection training must handle distributed processing. Spark or an in-house ML platform can do large-scale distributed transformations. A well-structured feature store ensures data consistency. The final model inference should also be distributed or at least parallelized so new events can be scored in near real time.

How do you plan to maintain explainability when migrating to more complex models?

If the core anomaly detection model becomes a black box, a separate explainer (like HAIFA) can still run on the final embeddings or output scores. You can also use SHAP or other feature attribution methods. For each flagged entity, the system highlights which features or embeddings deviate most, preserving interpretability for risk analysts.

Rohan's Bytes

Discussion about this post