ML Case-study Interview Question: Two-Stage ML Ranking for Travel Lodging with Real-Time Pricing & Availability.
Browse all the ML Case-Studies here.
Case-Study question
A global travel platform faces a ranking challenge for lodging properties. The system must handle queries for varying destinations of different sizes. The ranking engine has strict latency constraints, especially because each property’s availability and pricing must be fetched live. The Company wants a two-stage process where a smaller set of relevant candidates is pre-selected, then a more expensive re-ranking step is performed on that subset. Devise a comprehensive solution that includes:
How to implement candidate generation with machine learning or heuristic scores.
How to handle real-time availability and pricing constraints.
How to address the cold start problem to give new properties fair exposure.
How to measure success and optimize performance through experiments.
Provide all the details of your approach, including the data pipeline, modeling choices, and the final re-ranking. Propose suitable strategies for different markets (small and large) and ensure solutions scale with growing inventory.
Proposed Solution
Candidate generation narrows a large property set into a smaller subset. A final re-ranking step follows. Both steps demand low latency and high relevance.
Two-stage ranking pipeline
The full set of properties is often huge, especially for large search regions. Live availability and pricing checks for every property are expensive. Using a two-stage approach reduces the workload. First, a simpler model selects candidates. Second, a more complex model re-ranks those candidates using real-time price and availability signals.
Machine learning candidate generation
Historical shopping logs provide training data. These logs contain search parameters, property features (static or dynamic), and observed ranks. Learn a candidate-generation model from final ranks. If final ranks are unavailable for all properties, training can still proceed with data from existing partial ranking methods, provided selection bias is not too severe.
Models can be linear, factorized, or deep neural networks. Deep networks often excel because they incorporate user context, property features, and approximate price or availability data.
Relevancy can be binary or fractional:
Binary relevancy sets 1 if rank <= n, else 0. Fractional relevancy can be 1/rank. Models learn to push relevant items up.
After training, top-k items by predicted score form the candidate set. Use larger k in big destinations to ensure coverage.
Recall is a key metric. It measures how many top-ranked items from the final pass were correctly retrieved in the candidate set.
For a single query, relevant items can be the top-n final ranked properties. Predicted items are the top-k from the candidate generation. The fraction of overlap yields recall.
Dynamic features (like approximate price or time-sensitive signals) boost performance but increase complexity. Static-only features are easier to cache. A combined approach works best if user context and partial pricing or availability estimates are available.
Heuristic candidate generation
If no logs exist, build heuristic scores from past bookings, click-through rate (CTR), and conversion rate (CVR). Normalize by a visibility score to compensate for position or selection bias. If a property was rarely shown, you are less certain about its performance data.
To make the system explore new or underexposed properties, apply an Upper Confidence Bound logic. Increase the variance for less frequently shown items. This ensures better coverage of properties that might convert well but lack exposure.
Cold start handling
New properties require fair visibility. They have little historical data, so they might get lost if models rely purely on past engagement metrics. Inject them into the candidate set through a boost mechanism. One method interleaves new and established properties, ensuring at least p new properties in the final subset. Another uses a geospatial indexing system to detect pockets of new listings and boost them proportionally. The fraction of new items is tracked to keep coverage consistent.
Example Python snippet for candidate scoring
import pandas as pd
import numpy as np
def candidate_score(row):
# row contains features like bookings, clicks, visibility
performance = row["bookings"] * 5 + row["clicks"]
# Adjust by visibility to handle uncertainty
adj = np.sqrt(1 / (row["visibility"] + 1e-5))
return performance * adj
df["score"] = df.apply(candidate_score, axis=1)
df_sorted = df.sort_values("score", ascending=False)
candidates = df_sorted.head(300)
This code applies a simple heuristic-based formula. Bookings have higher weight than clicks. Low visibility increases the score variance.
Final re-ranking
Properties from the candidate set are enriched with live availability and accurate pricing. A more complex ranking model then produces the final order. This final model can incorporate deep learning using search context, user preferences, dynamic prices, or discount flags.
Follow-up question 1: How would you ensure your trained model is not biased by pre-existing candidate generation?
Candidate generation biases training data by skipping unseen properties. Explain how to measure the impact of this bias and mitigate it.
Answer:
Train on multiple candidate generation thresholds. If data is only collected from the top-300 properties, enlarge to top-1000 for a portion of the data. Compare performance changes. If item distributions differ, adjust training or implement randomization. Calibrate your model on a more diverse sample, possibly by adding exploration. Evaluate recall for multiple k to confirm you retrieve the truly relevant items.
Follow-up question 2: How do you handle dynamic price fluctuations that occur between candidate generation and final ranking?
Answer:
Cache approximate pricing signals or keep them updated in near-real time. At final ranking, fetch exact live prices for only the candidate set. This ensures minimal overhead. If stale data creeps in, refresh it at short intervals or incorporate fallback logic. If exact price changes lead to big ranking shifts, refine dynamic features or shorten caching windows.
Follow-up question 3: Why is recall at k relevant for measuring success, and when might you use other metrics?
Answer:
Recall at k gauges how many top-ranked results make it into your candidate set. If recall is too low, your final stage lacks key items. You might use normalized Discounted Cumulative Gain (nDCG) or Mean Average Precision (MAP) if the system cares about the exact order of retrieved items. When final rank positions matter strongly, nDCG is beneficial. For candidate generation, recall is enough because exact ranking is done in the second stage.
Follow-up question 4: How would you implement an online test to evaluate candidate generation changes?
Answer:
Run an A/B test. Split traffic so half uses your new candidate strategy, half uses the old one. Gather metrics on latency, recall, and overall conversion. Observe user engagement. A significant improvement in conversion or satisfied searches justifies rolling out the new method. Control for confounders, ensure both variants face similar conditions, and run tests for a sufficient duration to capture representative behavior.