ML Interview Q Series: How would you design a ML system to reduce missing or incorrect orders at DoorDash?

May 01, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

A key challenge in the online food-delivery domain is preventing mistakes such as missing items or an entirely inaccurate order from being delivered to a customer. The fundamental question is how to use machine learning to lower the frequency of these errors. A possible strategy is to create a supervised learning pipeline that flags orders at higher risk of mistakes and prompts interventions, either by human review, additional verification, or automatic notifications to the restaurant or delivery personnel.

Connect with me on X (Twitter)

Identifying the Learning Problem

The nature of the problem is typically a classification task, where the model attempts to distinguish between orders that are likely to be correct and those that are likely to be erroneous. From a data science perspective, each order could be accompanied by a label indicating whether it was correct or incorrect/missing an item. Over time, these outcomes can be recorded from customer complaints, support tickets, or refunds requested.

Data Collection and Feature Engineering

Building the right dataset begins with collecting a broad range of attributes for each order. Examples of these attributes include:

Order details such as the list of items, their quantities, and special instructions.
Historical patterns of similar orders from the same restaurant, time of day, or location.
Restaurant’s historical accuracy performance.
Delivery agent’s history (if any correlation exists with missing or incorrect deliveries).
Time-based factors such as surge hours or staff turnover schedules.

Feature engineering might involve capturing how often a specific item is reported missing, how complex the order is (number of special instructions), or the typical accuracy record of a restaurant during busy hours. Additional features can come from user behavior: how often does a particular user request complex modifications, and what is the usual error rate for those modifications?

Model Selection and Training

A wide range of models could potentially be suitable for this kind of classification. These might include gradient-boosted decision trees, random forests, or deep neural networks if the data volume is substantial. For a typical mid-scale system, tree-based ensembles often perform well because of their ability to handle mixed data types and complex feature interactions.

A straightforward supervised training approach is to define a binary classification label. Suppose y_i is 1 if the order ended up having an issue (missing or wrong item) and 0 if the order was correct. A standard cross-entropy loss can be used to train the model:

where:

L is the total loss function computed over N training examples.
y_i is the true label for the i-th training example in text format: (1 if erroneous order, 0 if correct).
hat{y}_i is the predicted probability for the i-th example in text format: (probability that the order is erroneous).
The goal is to find model parameters that minimize L across all training examples.

By minimizing this loss, the model learns to assign high probability to orders it believes to be at risk of errors and low probability otherwise.

Model Evaluation and Metrics

Traditional classification metrics help measure performance. Precision can indicate among flagged orders, how many truly had issues. Recall can measure what fraction of the truly problematic orders get flagged. A balanced measure like the F1 score might be used if one wants to weigh precision and recall equally. However, certain business considerations may favor recall over precision. If the cost of an unflagged wrong order is higher than an incorrectly flagged order, the system could be tuned to prioritize recall, possibly at the expense of precision.

In real-world scenarios, an imbalanced dataset is common, as only a relatively small fraction of orders may be incorrect. Techniques such as class weighting, oversampling of the minority class, or undersampling of the majority class can be applied to ensure that the model does not always predict “no error.”

Real-Time Inference and System Architecture

In a production setting, the model might run during the order checkout process. A real-time pipeline can be established, where newly placed orders are scored by the model. Orders with a high predicted risk of mistakes might trigger alerts or auto-verification steps:

Automated verification: The system can ask the restaurant to confirm each item if it detects a suspiciously large or complex order.
Human review: If the probability of an error surpasses a certain threshold, a customer service representative might manually verify details with the customer or the restaurant.
Customer confirmation: For extremely high-risk cases, the system could prompt a final reconfirmation from the user before order placement is finalized.

Monitoring, Feedback, and Iterative Refinement

After deploying the model, it is crucial to track its performance over time and compare outcomes with real-world data (e.g., support tickets, refunds). If the model’s accuracy declines, this might indicate data drift, changing restaurant behaviors, or seasonal variations in ordering patterns. Regular model retraining and hyperparameter tuning can keep performance optimal.

Example Code Snippet in Python

Below is a simplified example showing how one might use a random forest classifier in Python (using scikit-learn) to train a model that flags problematic orders. This is purely illustrative:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Assume df has columns like 'items_count', 'restaurant_error_rate',
# 'time_of_day', 'historical_user_modifications', 'label' etc.

df = pd.read_csv('orders.csv')

# Extract features and label
X = df.drop('label', axis=1)
y = df['label']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=42)

# Create and train model
model = RandomForestClassifier(n_estimators=100,
                               max_depth=10,
                               class_weight='balanced',
                               random_state=42)
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

This simplistic snippet demonstrates the basic data split, training, and evaluation steps. In production, there would be additional complexities, including advanced feature engineering, pipeline orchestration, real-time inference endpoints, and continuous model monitoring.

Potential Follow-up Question: Data Imbalance

How can we deal with the situation where only a small fraction of orders end up having errors, creating a highly imbalanced dataset?

One approach is to apply class-weighting in the loss function, which instructs the model to pay relatively more attention to the minority class (orders with errors) than to the majority class. Oversampling techniques like SMOTE can generate synthetic samples to augment the minority class. Undersampling the majority class is also possible, though it can lead to loss of potentially informative data. Another strategy is to adjust decision thresholds after the model predicts probabilities. By setting a lower threshold for classifying an order as “high-risk,” recall is increased at the possible expense of precision.

Potential Follow-up Question: False Positives vs. False Negatives

Which type of misclassification is more costly, and how would you incorporate that into the model design?

In many scenarios, a false negative (failing to flag an actually erroneous order) can be worse because it leads directly to a poor customer experience and potential refunds. A false positive (flagging a correct order as suspicious) might be less damaging because it only adds a bit of friction (e.g., an extra verification step). One can integrate a cost-sensitive approach where the model’s objective is weighted based on the relative cost of each error type. Alternatively, the decision threshold can be tuned so that the system errs on the side of flagging suspicious orders more often, boosting recall at the expense of precision.

Potential Follow-up Question: Model Interpretability

How can we ensure stakeholders trust the model’s decisions and maintain interpretability?

This system might have serious business implications, so stakeholders may want explanations when the model flags certain orders. Methods like SHAP (SHapley Additive exPlanations) or feature importance plots can help clarify which factors most contributed to a particular prediction. If a model is more complex (e.g., a large ensemble or deep neural network), interpretability tools become even more important. Ensuring that restaurant owners, delivery personnel, and customer service teams understand the basics of why some orders get flagged can foster trust and cooperation.

Potential Follow-up Question: Continuous Improvement and A/B Testing

How do you maintain and improve the system over time?

Regular A/B testing can measure if changes in model configuration or feature sets improve performance. By exposing a small percentage of orders to an updated model and comparing outcomes—like the rate of valid complaints or the number of times the model accurately predicts an issue—against orders handled by the existing model, you can quantify performance gains or declines. This process should be repeated whenever significant changes or retraining are performed, especially if new data sources or major feature engineering updates are introduced.

Potential Follow-up Question: Edge Cases

What are some edge cases that could break the system?

Unusually large or complex orders with special instructions that do not resemble anything in the training data can cause the model to be uncertain. A restaurant’s sudden drop in quality control due to unforeseen events (staff shortage, supply chain issues) might not be captured in historical data. Rapid changes in user behavior, such as a surge of new users or changes in ordering habits during major events, can also lead to distribution shifts. Designing a robust monitoring system that tracks incoming data distributions compared to training data distributions can help detect these shifts early.

Below are additional follow-up questions

How do you handle ephemeral or rapidly changing menus when your model relies on historical data about specific dishes?

Sudden menu changes—such as limited-time offers or seasonal items—can invalidate certain features in your dataset if the item no longer exists or if it appears for the first time without historical statistics. This situation poses data distribution shifts that your model may not have encountered during training.