ML Case-study Interview Question: Transformer Post-Processing for Accurate Ride & Delivery ETAs

Rohan Paul

Apr 13, 2025

Browse all the ML Case-Studies here.

Case-Study question

You are tasked with designing an Estimated Time of Arrival (ETA) prediction system for a global company that handles both rides and deliveries. A legacy routing engine provides route-based ETAs, but these estimates can be inaccurate due to real-world conditions such as traffic congestion or route deviations. You must build a post-processing machine learning system to produce more accurate ETAs on top of the routing engine’s outputs. The system must serve requests in real time with minimal latency. The goal is to support multiple lines of business around the globe, each of which might have different data distributions and different error tolerances.

Connect with me on X (Twitter)

Proposed Solution (Detailed)

System Overview

A routing engine provides raw ETAs by summing up segment-wise traversal times for a recommended path. A post-processing model then predicts a residual correction. The final ETA is the maximum of route_ETA plus predicted_residual and zero:

y_hat = max(route_ETA + residual, 0)

y_hat is the predicted estimated time of arrival.
route_ETA is the initial ETA from the routing engine.
residual is the correction value predicted by the machine learning model.
The max function ensures the final ETA is not negative.

This approach allows rapid iteration without modifying the routing engine itself. The post-processing model ingests features like origin, destination, timestamps, request type, and real-time signals. The serving latency must be a few milliseconds. The mean absolute error should improve over tree-based baselines, and the system must generalize across services worldwide.

Encoder-Decoder Transformer with Self-Attention

An encoder-decoder transformer architecture is used to capture complex interactions. The self-attention operation reweights every feature in context of all others. Suppose we have K input features, each embedded in dimension d. A scaled dot-product attention matrix of size K*K is formed for the standard transformer, which is computationally expensive. A linear transformer variant is used to reduce the complexity from O(K^2 d) to O(K d^2). This is beneficial when K > d, such as K=40 features with small dimension.

Self-attention clarifies how each feature (for instance, origin, destination, time of day, request type, traffic) influences the others. Each feature vector becomes a weighted sum of all feature vectors. This can dramatically improve predictions, especially for geospatial and temporal interactions.

Feature Embeddings and Bucketization

All continuous features are bucketized into quantile-based bins and then mapped into embeddings. Bucketizing reduces the work that the network must do to partition the input space. Embedding tables store discrete representations of each bucket. Categorical features also map to learned embeddings. Most parameters reside in these lookup tables.

Geospatial locations (latitude, longitude) receive special multi-resolution hashing. Multiple grid resolutions preserve both coarse-grained and fine-grained location details. Each grid cell gets hashed to a smaller embedding table. Multiple independent hashes reduce collisions while controlling overall memory usage.

Bias Adjustment and Final Output

A bias adjustment layer in the decoder tailors the network output by adjusting for large differences across business segments or regions (for example, short or long trips, rides or deliveries). A fully connected decoder then predicts the residual, which is added to the routing engine’s output. A ReLU ensures positivity. Extreme outputs may be clamped to limit their impact.

Asymmetric Huber Loss for Training

The system needs to handle outliers differently depending on the line of business. An asymmetric Huber loss with parameters delta and omega is used:

$$L_{asym}(y, \hat{y}) = \begin{cases}

0.5 \cdot \omega \cdot (,y-\hat{y},)^{2}, & \text{if } |,y-\hat{y},| < \delta \ \delta \cdot \omega \cdot \bigl(,|,y-\hat{y},| - 0.5 \cdot \delta \bigr), & \text{otherwise} \end{cases}$$

y is the ground truth, and y_hat is the model prediction. delta controls the transition between quadratic and linear regions. omega skews the loss to penalize overprediction or underprediction differently. This flexibility allows the same model to serve use cases needing robust means, quantiles, or other point estimates.

Training Workflow

Offline training uses large-scale data from historical trips. Spark pipelines handle feature transformation, model training, assembly, validation, and deployment. Embeddings are learned through standard data-parallel stochastic gradient descent. Once validated, the model is pushed to the serving system, where it processes requests in real time.

Online Inference

A high-throughput prediction service hosts the trained model. The service takes the route engine’s ETAs, along with the relevant feature set (origin, destination, timestamp, request type), then executes the model’s sparse lookups in the embedding tables, followed by the shallow transformer layers, bias adjustment, and final output. Only a fraction of the parameters is accessed per request, which keeps the latency within a few milliseconds.

Follow-Up Question 1

How does bucketizing continuous features provide a benefit over passing raw numeric values directly, especially in deep learning systems?

Explanation

Networks can learn any function given sufficient parameters and data. Bucketizing frees the model from having to learn the partitioning boundaries from scratch. Each bucket has its own embedding. Quantile-based bucketization assigns roughly equal amounts of data to each bin, improving information density in these representations. This helps the model capture discontinuities or threshold effects in the underlying distribution more easily. It also encourages better generalization across rarely populated ranges.

Follow-Up Question 2

How do multi-resolution geospatial embeddings enhance ETA predictions?

Explanation

Locations have hierarchical patterns at different scales. A single high-resolution grid might have too many sparse cells. A single coarse grid might lose local details. Multiple grids allow the model to see both a coarse partition (region-level insights) and a fine partition (precise local patterns) simultaneously. Hash collisions are mitigated by hashing the coordinates to multiple embedding tables, letting the network combine partial information from each. This yields more robust location understanding with limited memory overhead.

Follow-Up Question 3

Why choose an encoder-decoder architecture with self-attention rather than a simpler feed-forward structure?

Explanation

A simple feed-forward network treats each input dimension independently once the data flows into hidden layers. Self-attention directly reweights each feature by looking at every other feature. This is beneficial for highly correlated features, such as origin, destination, traffic signals, or time of day, since these interactions can be pivotal to accurately modeling arrival times. Self-attention explicitly captures these interactions in the attention matrix, enabling dynamic feature importance. A linear transformer helps keep the complexity manageable in real-time services.

Follow-Up Question 4

How does an asymmetric Huber loss accommodate multiple business constraints within a single model?

Explanation

Different business lines can have distinct tolerances for overestimation vs underestimation, as well as varying degrees of outliers. A single model can shift the loss function accordingly. A large delta or a heavy weight for omega in one direction de-emphasizes large errors in that direction. This is crucial for tasks where being late has a bigger penalty than being early (or vice versa). Tweaking these hyperparameters customizes the same architecture for diverse cost functions, ensuring the final estimates remain practical for a wide range of applications.

Follow-Up Question 5

How does discretizing inputs and storing them as embedding tables keep inference latency low despite a large parameter count?

Explanation

Most parameters reside in embedding tables. Each request only indexes a small subset of these tables, leading to sparse lookups. Sparse lookups are O(1) per queried bucket. This avoids dense matrix multiplications. A shallow decoder plus a linear transformer encoder further reduces computations. The space-time tradeoff is: large tables are pre-learned, but at serving time, only a tiny slice is used per request. This cuts down inference latency dramatically compared to computing complex transformations of each feature in fully connected layers.

Follow-Up Question 6

What are the practical challenges for maintaining this system in production, especially with changing traffic patterns or new business lines?

Explanation

Traffic patterns, new locations, and new service types can shift data distributions quickly. Periodic auto retraining and model validation are essential. Embedding tables may need expansions or new resolutions for novel regions. The bias adjustment layer must update to reflect new segments. Monitoring tools track feature drift, latency, and error metrics. An online platform typically automates data ingestion, scheduling retraining jobs, evaluating metrics, and pushing models to production. Careful deployment ensures minimal disruption and consistent performance.

Rohan's Bytes

Discussion about this post