ML Case-study Interview Question: Dual-Model ML Pipeline for High-Risk Vendor Detection with Limited Labels
Browse all the ML Case-Studies here.
Case-Study question
A global organization needs to identify high-risk intermediary vendors, referred to as “Agents,” that transact on behalf of the organization in various countries. Traditional rule-based methods could not capture the nuanced behaviors of these Agents, especially given that the available labeled data is both minimal and possibly biased. The organization wants to build a robust ML-based pipeline to detect these Agents. Using the limited labeled set of confirmed Agents and a broader set of Purchase Order (PO) and vendor-level features, how would you architect an end-to-end system to solve this classification problem? How would you handle the limited and unbalanced data, ensure interpretability, and design the pipeline so it can be periodically retrained as more data arrives?
Provide a detailed approach covering data ingestion, data cleaning, feature engineering, model design, model evaluation, and deployment steps. Include how you would address the need to detect vendors (rather than just single transactions), ensure a recall-oriented approach if false negatives are risky, and manage speed-performance trade-offs in a production environment.
Proposed Solution
A dual-model architecture addresses the requirement to use both PO-level data and vendor-level data. A first-level model flags suspicious transactions based on transaction-level features. An aggregation step converts those transaction predictions into vendor-level features. A second-level model then classifies each vendor as “Agent” or “Non-Agent.” This mitigates the sparse labeling issue by maximizing the use of all available transaction data while still considering vendor-level profiles.
Data Ingestion and Preprocessing
Raw data originates from a spend management platform. Each row represents a transactional record with key features (transaction type, PO descriptions, amount, currency, department, etc.). Minimal labeled data exists: a small subset of vendors confirmed as Agents.
Convert relevant categorical features (e.g., currency, department) into dummy variables. If 300 features result, push all features through the model if they might hold meaningful signals, given that partial dimensionality reduction may lose important variance.
First-Level Model (Transaction-Level)
A tree-based classifier (e.g., Random Forest or Gradient Boosted Decision Trees) trains on PO-level samples. Targets are derived by marking transactions from known Agents as positive examples. All other transactions remain unlabeled or negative by default. Because of severe class imbalance, focus on maximizing recall to avoid missing potential Agents. Use techniques such as grid search or random hyperparameter search for tuning. Generate PO-level predictions indicating the probability or direct class label of whether each transaction is suspicious.
Aggregation to Vendor-Level
Combine all transaction predictions for each vendor. Calculate metrics such as mean predicted-suspicious value, the count of high-risk transaction types, and the number of distinct subsidiaries engaged. This step enriches the vendor’s profile with aggregated features based on the first-level model’s signals.
Second-Level Model (Vendor-Level)
Train a Support Vector Machine or other suitable classifier to predict “Agent” or “Non-Agent” at the vendor level. Incorporate vendor-level metadata, including the aggregated statistics from the first-level predictions. Optimize for balanced accuracy or F1-score. For extremely imbalanced labels, stratified train-test splits help ensure minority classes appear in each fold.
Model Evaluation and Monitoring
Compare each candidate pipeline against null accuracy (majority-class prediction). If the labeled portion is small, evaluate recall to ensure minimal false negatives. Examine confusion matrices:
TP means true positives, FN means false negatives. High recall guards against missing potential Agents.
Perform an 80–20 stratified split, then look at confusion matrices in both training and validation sets. Because sample size is small, interpret results with caution. Once the final model is decided, deploy it to production, regularly refresh with new transactions and updated labels.
Production Considerations
Tree-based models typically retrain more quickly than Gradient Boosted frameworks with large hyperparameter grids. In an environment with frequent model updates, faster training cycles may be more practical. If new confirmed Agents appear, fold them into the labeled dataset and retrain both the first-level and second-level models. Over time, the model’s performance stabilizes as more diverse labeled data arrives.
Example Code Snippet
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, StratifiedKFold
# Assume df_transactions has columns: ['vendor_id', 'amount', 'currency_dummy_...', 'dept_dummy_...', 'label']
# Split data
X = df_transactions.drop(['label'], axis=1)
y = df_transactions['label']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=42,
stratify=y)
# First-level model (Random Forest)
rf_model = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
rf_model.fit(X_train.drop('vendor_id', axis=1), y_train)
# PO-level predictions
X_test['po_pred'] = rf_model.predict(X_test.drop('vendor_id', axis=1))
# Aggregate to vendor-level
df_agg = X_test.groupby('vendor_id').agg({
'po_pred': 'mean',
# add more aggregates...
}).reset_index()
# Merge vendor-level metadata if needed
# e.g., df_vendor_meta with columns like 'vendor_id', 'commodity_count', etc.
df_vendor_level = pd.merge(df_agg, df_vendor_meta, on='vendor_id', how='left')
vendor_X = df_vendor_level.drop(['vendor_id', 'true_label_if_available'], axis=1)
vendor_y = df_vendor_level['true_label_if_available'] # this is the vendor-level label if known
# Second-level model (SVM)
svm_model = SVC(kernel='rbf', class_weight='balanced')
svm_model.fit(vendor_X, vendor_y)
Potential Follow-Up Questions
How do you handle an extremely small set of confirmed positive labels while many negative labels remain uncertain?
Use a recall-oriented approach and a robust evaluation strategy. Random oversampling, SMOTE, or carefully designed hyperparameter searches can help. Calibrate the model with a threshold that prioritizes catching true positives. If negative labels are unverified, treat them with caution. Active learning can also prompt domain experts to label additional vendors.
Why not train only on vendor-level data?
PO-level data holds granular information about individual transactions that might get lost if you only rely on vendor-level aggregation. The dual-model pipeline inherits the fine detail from PO predictions and rolls it into a vendor-level profile. A single pass on vendor-level data alone cannot incorporate each transaction’s distinct risk signals.
Why did you choose Random Forest and SVM over alternatives?
A Random Forest quickly handles sparse and high-dimensional data and provides stable results with minimal tuning. SVM with appropriate kernels can effectively separate classes in imbalanced datasets once vendor-level features are aggregated. Naive Bayes was avoided because the prior probabilities are unknown and the small sample of Agents may not be representative of the true proportion.
How do you ensure interpretability for auditors or internal stakeholders who need assurance about the results?
Tree-based methods can generate feature importances, so highlight top contributors to the suspicious prediction. Show aggregated transaction patterns (e.g., repeated high-risk transaction types). SVM is less transparent by default, but combined with the first-level Random Forest feature rankings, you can give a sense of which signals drive the final classification.
Would you ever employ dimensionality reduction techniques here?
Dimensionality reduction (e.g., PCA) can simplify training times. But if there is a risk that relevant transactional details get lost, it may be better to keep most features. Monitoring performance through cross-validation helps decide if the small gains in speed outweigh the loss in predictive power.