ML Case-study Interview Question: Multi-Task Deep Learning for Scalable, Probabilistic Food Delivery ETA Predictions.
Browse all the ML Case-Studies here.
Case-Study question
A large-scale food delivery platform processes billions of orders yearly and wants to improve its ETA predictions for different types of deliveries (prepared meals, groceries, consumer pickups). The firmās existing tree-based ML models hit performance ceilings and do not handle uncertainty well. The firm seeks a two-layer system that provides probabilistic forecasts as a base layer and a decision-making layer that uses those forecasts for various business objectives. How would you design a robust multi-task deep learning model to produce accurate ETAs at scale, handle uncertain factors in real-time deliveries, and evaluate both the calibration and accuracy of these probabilistic predictions?
Detailed Solution
Multi-task modeling consolidates different ETA tasks into a single system. A shared deep learning foundation extracts general features across all deliveries. Lightweight ātask headsā then fine-tune predictions for each ETA type. This avoids training multiple specialized models and enables transfer learning from high-frequency tasks to less frequent ones.
Training a multi-task network involves designing a large shared backbone (such as a multi-layer neural network) that ingests broad features: geographic data, historical delivery performance, real-time traffic, and merchant characteristics. Each task head uses the backboneās learned representations and tailors the output layer for its specific ETA objective.
A two-layer architecture separates the unbiased base prediction layer from a decision layer. The base layer predicts a probability distribution for the delivery time. This distribution accounts for uncertainties (like preparation time, driver availability, and parking constraints). The decision layer incorporates business objectives such as minimizing lateness or reducing large ETA fluctuations that might confuse customers.
Deep learning replaces tree-based models to capture complex feature interactions and generalize to new conditions. High-dimensional embedding layers handle categorical variables (for example, merchant IDs or region codes). Deeper network layers learn patterns from the embeddings. The system can train on vast historical data and adapt to changing conditions with incremental retraining.
Probabilistic predictions require specialized evaluation. Traditional point-wise metrics like mean absolute error measure how close the average prediction is. They do not capture the distributionās spread. Probabilistic calibration measures whether predicted quantiles match actual frequencies. A Probability Integral Transform (PIT) histogram tests if deliveries are distributed uniformly across predicted quantile buckets.
CRPS(F(.|X), x) is the continuous ranked probability score. F(y|X) is the cumulative distribution function of the predicted delivery time. x is the observed delivery time. 1{y >= x} is the indicator function for y >= x. CRPS penalizes large deviations between the predicted distribution and the actual outcome. Lower CRPS indicates better probabilistic alignment.
In production, the model must respond with low latency. Engineering optimizations (like parameter pruning, optimized hardware, caching partial inferences) ensure real-time predictions for tasks such as homepage ETAs. Model fine-tuning strategies (for instance, sequentially training on each specific task head) help the shared backbone converge without overfitting to one task at the expense of others.
A sample Python approach using PyTorch might illustrate a multi-input neural network architecture. One example snippet:
import torch
import torch.nn as nn
import torch.optim as optim
class SharedBackbone(nn.Module):
def __init__(self, input_dim, hidden_dim):
super().__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
self.relu = nn.ReLU()
def forward(self, x):
out = self.relu(self.fc1(x))
out = self.relu(self.fc2(out))
return out
class TaskHead(nn.Module):
def __init__(self, hidden_dim):
super().__init__()
self.fc = nn.Linear(hidden_dim, 2) # Example param output for distribution
def forward(self, features):
return self.fc(features)
shared_backbone = SharedBackbone(input_dim=100, hidden_dim=256)
# Suppose we have multiple tasks for different ETAs
task_heads = {
"prepared_meals": TaskHead(256),
"groceries": TaskHead(256),
"consumer_pickup": TaskHead(256)
}
optimizer = optim.Adam(
list(shared_backbone.parameters()) +
list(task_heads["prepared_meals"].parameters()) +
list(task_heads["groceries"].parameters()) +
list(task_heads["consumer_pickup"].parameters()), lr=0.001
)
# Pseudocode for multi-task training
for batch in dataloader:
features, target, task_type = batch
backbone_out = shared_backbone(features)
preds = task_heads[task_type](backbone_out)
loss = custom_distribution_loss(preds, target) # For CRPS or calibration optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
Multi-task sets the stage for flexible optimization. Each training step updates shared parameters so knowledge learned from one delivery type can help others. The firm can attach new heads to handle newly introduced delivery modes.
Designing Probabilistic Evaluations
Calibration testing checks if predicted intervals align with real outcomes. A uniform PIT histogram means the distribution is neither over- nor under-dispersed. Over-dispersion presents a U-shape, indicating too much spread. Under-dispersion presents an inverted U-shape, missing the tails.
Continuous ranked probability score (CRPS) extends mean absolute error to the entire distribution. Minimizing CRPS ensures both the central tendency and the tails match reality. High CRPS would imply the distribution is skewed or too narrow, leading to frequent large misses.
Follow-up Question 1: How would you handle extreme outliers or long-tail deliveries?
Use appropriate distributions with heavier tails (like Weibull or certain mixture models) in the base layer. Estimate parameters that accommodate high-variance events. Increase the penalty on extremely late deliveries by customizing the loss function to emphasize those tail deviations. Fine-tune the deep model with relevant negative examples (very late deliveries). Evaluate CRPS specifically at higher quantiles to ensure correct tail modeling.
Follow-up Question 2: How would you prevent large fluctuations in ETAs when new information arrives?
Use a decision layer that includes a consistency regularization term. This term penalizes drastic changes in ETA unless justified by strong signals. Implement a multi-objective approach in the decision layer, balancing on-time accuracy with stable updates. Incorporate a time-decay strategy: as the order nears completion, updates can become less volatile to avoid confusing the user.
Follow-up Question 3: How would you address data imbalance, such as rare grocery orders vs. frequent restaurant deliveries?
Multi-task modeling shares the backbone. The high-volume tasks help train robust feature representations. The smaller task heads fine-tune on grocery data. This approach transfers knowledge of fundamental delivery dynamics from frequent tasks. Augment grocery data through targeted sampling or domain adaptation. The shared backbone benefits from broad patterns learned at scale.
Follow-up Question 4: How do you tune your distribution outputs for real-time use at large scale?
Limit the number of parameters in your neural network or prune unneeded layers to reduce inference time. Use model distillation to compress the model for production. Pre-calculate partial embeddings for static features. Employ efficient caching for repeated calls. Parallelize or batch queries through a shared service layer. Continuous monitoring of latency ensures the architecture meets production requirements.
Follow-up Question 5: Why did you switch from tree-based methods to deep learning when your data size grew?
Tree-based methods plateaued in accuracy and struggled with rare conditions. Deep networks capture complex feature interactions with embeddings. They adapt better to new scenarios with minimal re-engineering. Large volumes of data suit deep architectures, letting them discover hidden patterns. Neural networks can predict continuous distributions and support multi-task learning more naturally.