ML Case-study Interview Question: Real-Time Business Shopper Identification on eCommerce Platforms using xgboost

Rohan Paul

Apr 22, 2025

Browse all the ML Case-Studies here.

Case-Study question

You have been tasked with identifying business shoppers among a large eCommerce platform’s overall customers. The platform has a separate program for business users that offers special benefits. Many real businesses end up shopping on the normal consumer site without signing up for the business program. You must design a real-time system that predicts the probability of a shopper being a business. High-probability shoppers should receive additional outreach from a dedicated sales team. The dataset has incomplete labels because many business buyers shop on the consumer site and never self-identify. You are required to propose an approach for obtaining ground-truth labels, deciding on model architecture, handling real-time feature generation, and selecting precision or recall thresholds for specific outreach campaigns.

Connect with me on X (Twitter)

The interview panel wants you to provide a concrete plan. You must:

Straight away explain your labeling strategy. Explain how you design a real-time pipeline with strict sub-second constraints. Describe your choice of machine learning framework. Show how you would select thresholds for different outreach channels.

Provide details about practical implementation, including how to handle data streaming and storing features, how to maintain large historical information for each user in real time, and how to ensure that phone outreach is only triggered for high-precision leads.

Present your design choices and reasons. Include any relevant math formula for the threshold-based classification.

Proposed Detailed Solution

The goal is to flag probable business shoppers in real time. The design has two main components: a labeling pipeline and an inference pipeline.

Labeling Strategy

Build a labeled dataset from multiple sources. Known business shoppers become positive labels. Customers who explicitly opt out of business marketing become negative labels. Increase coverage by using a record linkage algorithm to find more business customers based on matching addresses against known business databases. Fill in remaining unlabeled customers using an offline neural network, assigning positive or negative labels above or below certain confidence levels. This hybrid labeling approach yields adequate training data volume while ensuring label accuracy.

Real-Time Pipeline

A sub-second pipeline is essential for immediate outreach. On each customer order event, fetch both real-time streaming features (like the current order details) and precomputed features from historical data (like lifetime order count). Real-time data is stored in a high-throughput key-value store. Historical data is precomputed in a feature store at regular intervals. Both sets of features are merged in memory. Then the model runs inference and returns the probability that the customer is a business.

Chosen Model Framework

xgboost is selected for its speed and good performance with tabular data. It handles large feature sets and supports fast inference within sub-second constraints. It is trained on a large set of historical labeled data with the objective of maximizing the area under the precision-recall curve (PR AUC). This better handles heavy class imbalance compared to accuracy.

Threshold Selection

Different outreach channels (like phone calls vs. on-site prompts) have different costs. To trigger phone outreach for high-probability shoppers, pick a threshold that optimizes precision. For low-cost channels such as an on-site banner, select a lower threshold to maximize recall.

Here P(isB2B) is the model’s predicted probability. theta is the threshold, chosen to meet a desired trade-off between false positives and false negatives. A lower theta captures more true business shoppers (higher recall) but may bring more false positives.

Practical Implementation Details

Store streaming features (e.g. last search query, cart contents) in an in-memory database with low latency reads. Maintain historical aggregates (e.g. total lifetime spend, average order size) in a feature store that updates every 12 hours. At inference time, query both stores, merge features, and run xgboost. Coordinate all feature retrieval in a real-time API that responds within the tight SLA. Cache predictions to avoid repeated inference if multiple messages or triggers happen shortly after the same order event.

Code Snippet (Example Sketch in Python)

import xgboost as xgb
import time

def get_real_time_features(customer_id):
    # Query fast cache for immediate features
    return {...}

def get_historical_features(customer_id):
    # Query feature store for precomputed aggregates
    return {...}

model = xgb.XGBClassifier(...)
model.load_model("hamlet_model.xgb")

def predict_business_probability(customer_id):
    rt_features = get_real_time_features(customer_id)
    hist_features = get_historical_features(customer_id)
    merged_features = {**rt_features, **hist_features}
    feature_vector = prepare_feature_vector(merged_features)
    score = model.predict_proba([feature_vector])[0][1]
    return score

def handle_order_confirmation(customer_id):
    start_time = time.time()
    score = predict_business_probability(customer_id)
    elapsed_time = time.time() - start_time
    if elapsed_time < 1.0:  # Check sub-second SLA
        if score > phone_threshold:
            trigger_phone_outreach(customer_id)
        elif score > onsite_threshold:
            show_onsite_banner(customer_id)

No introduction or conclusion lines are needed.

Possible Follow-Up Questions

How would you ensure robust labeling without mistakenly labeling customers?

Combine multiple sources of truth. Rely on strong signals (like self-identified businesses) for direct labels. Use record linkage to confirm known business addresses. Ensure the neural network that assigns synthetic labels is trained on examples with reliable labels. Validate labeling precision by manual inspection of a random sample. Retune the confidence threshold if error rates are too high.

How do you validate your model’s performance offline before deployment?

Hold out a test set with known business labels. Calculate standard metrics such as PR AUC, precision at different recall levels, and recall at different precision levels. Measure the accuracy of synthetic labels on a subset cross-checked by known ground-truth. Conduct ablation tests to see which features drive performance gains. Check inference latency in a staging environment with production-like data volumes.

Why is PR AUC chosen over ROC AUC for model selection?

Business classification is heavily imbalanced, so overall accuracy and ROC AUC can be misleading. Precision and recall better reflect business objectives: you care about capturing true businesses (recall) while minimizing false positives (precision). PR AUC summarizes performance across multiple operating points. ROC AUC may appear high even if the model fails at finding a small positive class among a massive negative class.

Why is xgboost used instead of deep neural networks?

xgboost handles structured data with many categorical and numeric features efficiently. It yields simpler debugging, faster training, and typically excels in tabular contexts. It also offers robust hyperparameter tuning and feature importance analysis. Real-time inference is simpler with xgboost’s lightweight model.

How do you handle model drift?

Business shoppers’ behavior evolves. Periodically retrain with the latest data. Keep track of changes in feature distributions. Monitor precision and recall over time. If performance metrics degrade, schedule or trigger retraining. Use online or incremental learning if the volume of new data is high and distribution shifts often.

How do you scale this design for millions of users?

Deploy a horizontally scalable real-time service that can handle high throughput. Shard data for both streaming and historical feature stores. Use a distributed architecture for the feature store (like partitioned SQL or NoSQL stores) and replicate frequently accessed data in in-memory caches. Scale xgboost inference by hosting the model in a load-balanced microservice.

No additional introduction or conclusion is required.

Rohan's Bytes

Discussion about this post