ML Case-study Interview Question: Scalable Granular Forecasting: Efficient Ensemble Stacking with Ray

Rohan Paul

Apr 12, 2025

Case-Study question

A food-delivery enterprise needs to forecast demand with high accuracy while keeping computational costs low. They have numerous granular forecasting targets (like city-level or even smaller region-level time series), causing standard model-selection frameworks with exhaustive grid search to run for hours and consume large amounts of compute resources. They seek a single solution that produces stable, accurate forecasts for many targets, and that also accommodates external covariates such as weather or holiday promotions. Propose an end-to-end system design to handle this. Include suggestions on data preprocessing, model architecture, an efficient training pipeline, and infrastructure choices.

Connect with me on X (Twitter)

Provide full details on how to:

Balance high accuracy and speed.
Aggregate forecasts from multiple candidate models.
Ensure maintainability, modularization, and transparency into how each model contributes to the final prediction.
Maintain system stability and handle edge cases.

Detailed Solution

Overview of the Proposed Approach

Use a two-layer ensemble design to capture different strengths from diverse base models. Replace exhaustive grid searches with a more efficient stacking scheme. Generate partial forecasts from faster base learners in parallel, then feed those forecasts into a secondary model that fuses them into a final prediction. Minimize repeated training and evaluate base forecasts using a specially designed time-based cross validation. Combine external features with stacked forecasts in the final layer.

Ensemble Forecast Formula

w_i: weight (or learned coefficient) assigned to the i-th base model. F_i(t): forecast generated by the i-th base model at time t. N: number of base models. t: time index (e.g., weekly, daily, etc.).

Base Layer

Train multiple candidate models in parallel. Each candidate differs by:

Underlying algorithm (e.g. ARIMA, Prophet, LightGBM).
Configuration settings (additive vs. multiplicative seasonality, holiday adjustments, outlier treatments).
Any customized logic (special handling for extreme weather or promotions).

Generate forecasts for each time horizon segment. Collect these partial forecasts as inputs to the top-layer model.

Ensemble Layer

Fit a flexible model (could be a linear model or neural net) that learns how each base model performs across time segments. Use time-based splitting for validation to ensure no data leakage. Stack the forecasts from the base models, combine them with external covariates (e.g. holiday flags, weather signals), then fit the ensemble. Produce final predictions by applying the learned weights or coefficients to each base model’s forecast.

Infrastructure and Parallelization

Distribute training at two levels:

Parallelize by target. Different time series (e.g. region A vs. region B) run in separate tasks.
Parallelize among base models inside each target’s workflow.

Use Ray for orchestration of distributed tasks. Spin up Ray clusters on your existing container infrastructure, minimizing overhead for data distribution and speeding up cross validation.

Example Code Snippet

import ray
from ray.util import placement_group
import pandas as pd

@ray.remote
def train_base_model(data, model_params):
    # Example placeholder for training a single base model
    # Return fitted model or its forecasts
    pass

def ensemble_forecast(train_data, test_data, base_model_params_list):
    # Parallel training of base models
    futures = []
    for params in base_model_params_list:
        futures.append(train_base_model.remote(train_data, params))
    base_forecasts = ray.get(futures)

    # Stack the forecasts
    stacked_df = pd.DataFrame(base_forecasts).T  # each column is a base forecast

    # Fit ensemble layer (could be a simple linear reg or ML model)
    # Example: y = a1*f1 + a2*f2 + ... + aN*fN
    # Fit on train_data with cross validation
    # Return final predictions on test_data
    pass

# Example usage
if __name__ == "__main__":
    ray.init()
    train_data = ... # your training timeseries
    test_data = ...  # your test timeseries
    base_model_params_list = [...]
    final_forecast = ensemble_forecast(train_data, test_data, base_model_params_list)

Model wrappers for each base learner ensure plug-and-play integration. Returning partial forecasts only, rather than refitting the entire pipeline, keeps everything fast and less costly.

Practical Benefits

Reduced Cost: Limited repeated training. Only generate partial forecasts from each model once, then combine them.
Better Accuracy: Each base learner contributes its strength, improving performance over any single model.
Scalability: Parallelize across thousands of granular time series. Ray-based clusters handle load with nested parallelization.
Maintainability: Simplify system checks. Fewer big training loops, fewer model-selection complexities.

Edge Cases

Uncommon data patterns might favor a specialized model for extreme behavior. The ensemble approach, by averaging across multiple models, may weaken specialized signals. Future enhancements could incorporate dynamic weighting or fallback models for rare but high-impact scenarios.

Follow-Up Question 1

How do you handle missing data or extreme outliers in time series before stacking base models?

Answer and Explanation

Use standardized preprocessing steps. Fill missing values with approaches such as forward-filling or interpolation if the gap is small. For large gaps, segment the time series, train local models, then reconcile. Handle outliers via capping, log transforms, or robust scalers. Feed the cleaned data into each base model to stabilize training. Ensure consistent transformations across all base learners so their forecasts remain comparable.

Follow-Up Question 2

How do you decide the split for temporal cross validation in this stacking setup?

Answer and Explanation

Use rolling windows aligned with how data is generated in production. Suppose you have T total time points. Partition T into blocks such that older blocks train each base model, and the immediate next block tests that model. Stack each model’s predictions across folds to form a training set for the ensemble layer. Preserve chronological order so future information is never leaked backward.

Example: For weekly data from week 1 to week 52, train on weeks 1–30, forecast weeks 31–35; then train on weeks 1–35, forecast weeks 36–40; etc. The ensemble layer fits these forecast vs. actual pairs, learning the best combination rules for each time horizon.

Follow-Up Question 3

How do you ensure the final ensemble predictions remain stable and not excessively oscillatory?

Answer and Explanation

Feed the stacked forecasts into a model (e.g. a regularized linear regressor) that emphasizes smoothness in coefficients. Use a small set of base models with complementary behaviors. Combine them with external features that explain sudden changes, reducing abrupt swings. Run validation checks that compare short-term and long-term predictions against continuity constraints. If large sudden jumps appear, investigate the base models contributing to that spike and adjust weights or introduce smoothing in the ensemble layer.

Follow-Up Question 4

Why switch from a Spark-based pipeline to Ray?

Answer and Explanation

Ray provides more transparent logging, simpler debugging, and native Python concurrency that fits well with Python-based model code. Nested parallelization is easier to implement. Spark can obscure error messages when tasks fail on worker nodes. Ray addresses these issues by giving direct access to distributed objects and returning clear exception traces. This leads to faster iteration, lower operational overhead, and better resource utilization.

Follow-Up Question 5

How would you incorporate specialized deep learning models into the ensemble?

Answer and Explanation

Wrap the specialized architecture in the same base-model interface. For example, define a class that trains a neural net and outputs forecasts. Integrate it with identical input-output columns. The ensemble layer then treats that neural net as just another candidate. If this specialized model performs well in certain patterns, the ensemble weights it more heavily there. The rest of the framework remains unchanged, benefiting from the neural net’s strengths without rewriting the entire pipeline.

Follow-Up Question 6

What if you need to forecast multiple horizons (e.g., 1-day ahead, 7-day ahead, 14-day ahead) at once?

Answer and Explanation

Produce separate partial forecasts from each base model for each future time offset. Create columns in the stacked data for horizon_1, horizon_7, horizon_14, etc. Train an ensemble layer that either learns a separate model per horizon or learns a multi-output approach. The same parallel strategy still applies, just with multiple sub-forecasts. Ensure the cross validation accounts for each horizon independently so the ensemble layer sees which base model works best at each lead time.

Follow-Up Question 7

How do you mitigate the risk of choosing too many base learners and overfitting at the ensemble layer?

Answer and Explanation

Limit the number of candidate models or apply regularization methods in the ensemble. Use shrinkage (e.g. L2 regularization) to penalize the sum of squared coefficients. Perform a validation-based feature selection: drop base models that consistently underperform. Monitor out-of-sample error curves. If adding more base learners fails to improve validation metrics, stop adding them or reduce their weights. Keep an eye on correlation among base model forecasts. Highly correlated models offer redundant information and can overcomplicate the ensemble.

Follow-Up Question 8

How do you manage sudden operational or data changes in production?

Answer and Explanation

Retrain periodically and adopt a champion-challenger approach. Keep an older ensemble pipeline as a fallback. Monitor error metrics in near real time. If data shifts (e.g. new region or major changes in customer behavior), schedule faster retraining or partial updates. Automate checks that compare recent performance with historical ranges. If the model drifts beyond tolerance, trigger a re-run of the ensemble pipeline on the new data. Keep logs for all retraining runs to pinpoint issues quickly.

Follow-Up Question 9

What if specialized outlier models are needed for a highly unusual event?

Answer and Explanation

Incorporate a specialized model only for that event and add it to the base layer. Let the ensemble layer learn that model’s high contribution in those unusual intervals. If the outlier event is extremely rare, partition data or use a known event flag. Another approach is to detect the event beforehand, then route the forecast request to the specialized model directly or blend it more heavily. This ensures the ensemble can still handle general scenarios while the specialized model captures the rare extreme pattern.

Follow-Up Question 10

How does the system scale when the number of time series expands dramatically?

Answer and Explanation

Employ the nested parallelization design. Distribute each target to different nodes in the Ray cluster, then inside each node parallelize the base model training. Keep the base layer fast by using simpler methods or partially fit deep models with truncated training windows. If memory or node constraints arise, add more cluster nodes or reduce concurrency to avoid overhead. Ray’s cluster autoscaling can adapt to varying loads, spinning up or down resources based on queue length or CPU usage.

Follow-Up Question 11

How do you handle a scenario where different time series have different seasonality or data frequency?

Answer and Explanation

Allow the base models to set their own seasonality parameters or differ in how they handle daily vs. weekly frequency. Keep these details encapsulated within each base-model wrapper. During stacking, the ensemble layer only sees the forecast outputs plus any shared external features. This architecture permits each base model to adapt to the specific target’s frequency. If some series are daily and others weekly, ensure you keep separate pipelines or align them in consistent time units internally.

Follow-Up Question 12

How do you evaluate success and confirm that the system meets business requirements?

Answer and Explanation

Track key metrics (MAE, MAPE, RMSE) at each horizon. Compare them against baseline methods (simple lags, previous single-model pipelines). Monitor runtime and compute costs for the entire pipeline. Validate real-world performance with backtests over multiple time windows. Confirm that forecasts integrate smoothly with downstream services, such as resource planning or variance reduction for experiments. For final acceptance, show that the ensemble approach consistently outperforms old frameworks in both accuracy and speed under realistic workloads.

Rohan's Bytes

Discussion about this post