ML Interview Q Series: How would you design an occupancy forecasting model, gather training data, and evaluate its performance?

May 01, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

Building a model that predicts hotel occupancy for a particular date involves treating occupancy rate as a continuous target. In practice, occupancy rates typically range from 0 to 1 if interpreted as a fraction, or from 0% to 100%. A suitable strategy is to collect relevant features (including historical occupancy data, seasonal patterns, local events, pricing fluctuations, and marketing information) and train a supervised regression model that forecasts occupancy on future dates. Another approach is to treat it as a time-series forecasting problem, leveraging either classical or deep learning methods.

Connect with me on X (Twitter)

Data Requirements

Data is central to creating a reliable forecast. For occupancy prediction, common data sources include historical occupancy rates, the total number of bookings made each day, cancellations, events or holidays, room availability, pricing schemes, and external factors such as weather. Integrating promotional campaigns, competitor actions, and traveler demand trends can further improve prediction accuracy.

Historical occupancy data for each hotel date pair is particularly important because it captures recurrent temporal patterns and demand cycles. Many hotels exhibit significant seasonality due to vacation periods, holidays, or local festivals. Incorporating these aspects helps the model detect and quantify the impact of recurring spikes or dips in occupancy. Data that reflects booking lead times, especially the distribution of how far in advance guests book, can also enhance the model’s ability to predict occupancy for future dates.

Modeling Approaches

A typical way to formulate the problem is via a regression model, with occupancy as a continuous variable. When dealing with time-varying data, a time-series forecasting model, such as ARIMA or LSTM (Long Short-Term Memory) networks, can be employed. In more complex settings, ensembles of tree-based regressors (like Random Forest or Gradient Boosting) can yield robust performance if fed with the right time-dependent features.

Including features that capture temporal aspects (e.g., the day of the week, holiday flags, special events) and deriving lag features (e.g., occupancy from the previous day, week, or month) are common feature engineering steps. If deep learning architectures such as an LSTM or a Temporal Convolutional Network (TCN) are used, they can learn long-term dependencies automatically, although they may require larger amounts of data.

Example of a Regression Loss Function

One of the most common objective functions for regression-based occupancy forecasting is Mean Squared Error (MSE). A well-known measure derived from MSE is Root Mean Squared Error (RMSE). The formula for MSE can be expressed as follows:

where n is the number of samples, y_i is the actual occupancy rate on the i-th sample, and \hat{y}_i is the predicted occupancy rate on the i-th sample. Minimizing MSE pushes the model to reduce the squared difference between the forecasted and actual values.

RMSE is simply the square root of MSE, which keeps it in the same unit as the original target:

Model Evaluation Strategy

Predictive accuracy on unseen future dates is critical. To evaluate time-series predictions, it is recommended to use a rolling or sliding window approach that respects temporal ordering rather than random shuffling, ensuring that training is always performed on data preceding the test set. Common error metrics for occupancy forecasting include RMSE, Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE). In some hotel management scenarios, MAPE can be especially intuitive, because management might care about percentage deviations in predicted occupancy.

Potential Implementation Outline

Training a regression or time-series model in Python might follow these steps:

import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Assume we have a DataFrame df with columns:
# 'date', 'occupancy_rate', 'day_of_week', 'is_holiday', 'average_daily_rate', etc.

# Sort by date
df = df.sort_values('date')

# Create features and labels
features = ['day_of_week', 'is_holiday', 'average_daily_rate', 'lead_time', 'event_indicator']
X = df[features]
y = df['occupancy_rate']

# Time-based split
tscv = TimeSeriesSplit(n_splits=3)
for train_index, test_index in tscv.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    model = RandomForestRegressor(n_estimators=100)
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    rmse = mean_squared_error(y_test, predictions, squared=False)
    print("Fold RMSE:", rmse)

The approach above uses a Random Forest regressor, but the same idea could be implemented with an LSTM or a gradient boosting model, where the primary difference would lie in how time-series features are fed or how the network is shaped.

Practical Considerations

A variety of real-world considerations can arise. In hotels, occupancy is influenced by unexpected events, major citywide conferences, and changes in marketing strategy. Data might also have missing values or outliers from partial system outages or measurement errors. Ensuring data quality and maintaining model retraining procedures are essential.

Since occupancy has an upper bound (the number of total rooms), advanced techniques like quantile regression or hierarchical forecasting might be explored to account for the maximum capacity constraints. Proper hyperparameter tuning and regular model updates can help maintain accuracy over time.

How to Handle Edge Cases

One subtlety arises in cases of limited historical data for a newly opened hotel. Bootstrapping methods or transferring knowledge from similar hotels in the same region can help. Another edge scenario involves abrupt shifts in external conditions, such as global travel restrictions. In that context, model performance heavily depends on how quickly retraining is performed or how effectively the model can adapt to distribution shifts.

Follow-up Questions

How would you incorporate external data such as local events or promotions into the model?

Such external variables can be encoded as additional features (e.g., event_indicator=1 if a major event is in town). Depending on data granularity, one might also create numerical features capturing expected event attendance or promotional campaign intensity, which could then be fed into a regression or time-series model. In a deep learning setup, one could add this information as separate input channels or embedding vectors. The goal is for the model to correlate these external factors with changes in occupancy.

Why might you prefer a time-series model like ARIMA over a standard regression model, or vice versa?

A purely time-series approach like ARIMA can capture autocorrelation and seasonality in the data with minimal domain knowledge. However, if you have many diverse covariates (pricing, competitor moves, campaigns, external data), a regression or a machine learning method that handles multiple features might be more suitable. Complex time-series architectures like LSTM can combine the best of both worlds by learning from both temporal patterns and exogenous features.

What strategies would you use to mitigate cold start issues for newly opened hotels?

When a hotel has insufficient booking history, transfer learning can be employed by training a global model across multiple hotels. Parameters learned from established hotels, especially those in a similar market, can be fine-tuned for the new hotel. Alternatively, one could rely more on externally sourced signals (like market-level demand trends or competitor occupancy if available) until sufficient data for the new hotel is accumulated.

How do you ensure your model remains accurate over time?

Regular model retraining is essential to address distribution shifts in the data. If the hotel changes its pricing drastically, or if economic conditions fluctuate, existing historical patterns might no longer apply. Monitoring forecast errors and implementing automated retraining pipelines can keep the model aligned with recent changes. Incorporating rolling windows or an expanding window training approach are also viable ways to adapt the model continuously.

How would you handle scenario-based forecasting?

Scenario-based forecasting might be needed if management wants to see how occupancy would respond to hypothetical strategies such as a 10% price cut or a new marketing campaign. This could be achieved by retraining or re-running inference with modified feature inputs that reflect the hypothetical condition (price=price*0.9 for the 10% discount case) and observing how the predicted occupancy changes. Advanced techniques like counterfactual analysis can also be integrated to estimate causal impact.

What if there is an upper limit on occupancy?

Since occupancy cannot exceed 100% (or 1.0 if it is expressed as a fraction), some practitioners might train the model on a logit-transformed version of the occupancy rate. This ensures predictions remain within feasible bounds once inverted back. Another approach is to clamp predictions at the maximum possible occupancy level, though a transformation-based approach is often more mathematically consistent.

All of these considerations highlight how building a robust occupancy prediction system requires careful feature engineering, thoughtful model selection, appropriate evaluation strategies, and a plan for continual maintenance to address real-world complexities.

Below are additional follow-up questions

How do you handle situations where the ratio of out-of-date features to live features becomes large over time, leading to potential data leakage?

When a model leverages features that indirectly reveal future information, it can lead to data leakage. For instance, if an external dataset is not properly synchronized (e.g., a pricing feature updated after the booking date), it might inadvertently leak “future” knowledge. To mitigate this, you should:

Strictly Align Feature Timestamps: Ensure that each feature represents only the information available at the moment the prediction is made. If you are predicting occupancy for a date a month out, do not include any data captured after that date.
Maintain a Consistent Feature Update Process: The same pipeline that runs in production to feed real-time or near real-time features must be reflected in your historical data preparation. This avoids introducing signals in your training dataset that the model would not have at inference time.
Periodic Audits: Conduct regular audits on your data engineering pipeline. Validate that the time-based dependencies are logically correct and that you are not mixing past and future data in a way that the model gains unfair foresight.