ML Case-study Interview Question: ML-Driven Real-Time Grocery Availability with Hybrid Refresh and Dynamic Thresholds.
Browse all the ML Case-Studies here.
Case-Study question
A rapidly growing online grocery marketplace serves millions of items from more than 80,000 retail locations. During a sudden surge in orders triggered by an unexpected global event, accurate item availability predictions became challenging. Inventory systems varied across retailers, and frequent stockouts led to lower customer satisfaction. The company decided to design a Machine Learning-driven real-time availability (RTA) solution that could handle hundreds of millions of item predictions daily while keeping latency low and consistency high across multiple user-facing surfaces. Propose a complete system to solve for real-time availability, scale these predictions efficiently, ensure consistent in-app availability signals, and enable fast experimentation for new models. Discuss how you would store large volumes of predictions, how you would update them in near-real-time, how you would experiment with multiple ML models at once, and how you would handle the trade-off between maximizing selection and maintaining a high found rate. Include thresholds, dynamic adjustments, and any feedback loop approaches.
In-Depth Solution
System Architecture
Data arrives from diverse retail inventory feeds. The predictions are produced by a real-time ML service. A storage layer stores the model output. An ingestion pipeline updates scores in a database multiple times a day through a full sync and on demand via a lazy refresh when items appear in search results. The system reads these scores from the database, ensures low-latency lookups, and serves consistent availability signals to downstream services. This approach enables bulk retrieval using SQL joins, avoiding high overhead from many real-time RPC calls.
Lazy Refresh vs. Full Sync
A full sync happens a few times per day to ensure rarely accessed items still get refreshed. Lazy refresh triggers when specific items appear in search results and exceed a certain age threshold. A background job reads these updates from a stream and upserts them. This hybrid model cuts ingestion load while keeping frequently accessed items fresh.
Multi-Model Experimentation
Separate columns in the database store scores for each experiment model. A configuration file maps each model version to its column. Both full sync and lazy refresh processes use this mapping to keep multiple columns updated simultaneously. A feature flag toggles which column is used in real-time. This lets ML teams swap in a new model and collect metrics without waiting for complex engineering changes.
Threshold and Delta Framework
Thresholds define the cutoff for labeling an item as in-stock or out-of-stock. Different product segments, retailers, and user segments may require distinct thresholds. A base threshold is set for each ML model, discovered through experiments. Adjustments (deltas) shift the base threshold up or down for specific segments. Multiple deltas can stack to form a final threshold. This keeps the system modular, letting teams fine-tune certain segments without re-deriving an entire new global threshold.
total_combinations is the product of all relevant segments, which can get large quickly. This framework avoids enumerating all possibilities by applying deltas to a base threshold.
Dynamic Deltas
An offline optimization loop monitors selection, found rate, and user behavior over time. It adjusts segment-specific deltas to keep the overall system close to the optimal operating point. This can be done with bias-correction techniques or advanced methods like contextual bandits. The system gradually converges on thresholds that balance selection with minimal disappointment for each segment.
Example Code Snippet for Lazy Score Refresh
import time
import kinesis
import database
def lazy_refresh_processor():
while True:
messages = kinesis.read_stream()
if messages:
for msg in messages:
item_id = msg['item_id']
new_score = msg['score']
database.upsert_item_score(item_id, new_score)
time.sleep(2) # sleep before next poll
This illustrates how a background job can poll a stream, read item updates, and push them to the database.
Practical Observations
Scores change more frequently for popular items. Less popular items rarely need updates, so relying on a full sync for them is efficient. The system must maintain consistent states across surfaces to avoid user confusion. A core table in the database serves as the single source of truth.
Follow-up Question 1
How do you ensure the real-time availability scores stay consistent across different services?
Answer Storing scores in a single database table keeps the data consistent. Each service queries this table and merges the score on demand or through scheduled merges. Using the same ingestion pipelines prevents stale discrepancies. A version identifier in the table ensures each service references the correct model version. If an item’s score is updated by a lazy refresh, the table is quickly updated, and all querying services see the fresh value. In-memory caches, if needed, must have short Time to Live or event-driven invalidation.
Follow-up Question 2
How would you handle scaling when the number of items grows even larger and models become more complex?
Answer Sharding the database by retailer or region reduces read and write contention. Horizontal scaling with multiple read replicas helps absorb high query traffic. Partitioning the ingestion pipeline by retailer or item categories ensures partial failures do not affect the entire pipeline. For the ML scoring service, parallelizing inference across multiple worker nodes handles higher throughput. For advanced load, specialized key-value stores or columnar systems can be used for ultra-low-latency retrieval. Efficient streaming services can handle spikes in refresh frequency.
Follow-up Question 3
How do you balance searching for new optimal thresholds versus the risk of frequently changing thresholds that might destabilize user experience?
Answer Thresholds are changed gradually in the delta framework. A slower feedback loop allows time to gather stable metrics on how changes affect user behavior. The offline optimization observes patterns over days or weeks. If the system sees a sudden event that strongly shifts user behavior, the approach can accelerate threshold changes but still in small increments. Feature flags help test new thresholds on a small percentage of traffic first. This prevents drastic changes from affecting all users at once.
Follow-up Question 4
How do you measure success and iterate on ML models when you have multiple objectives like selection, found rate, and retention?
Answer Success is tracked by metrics such as order completion rates, item-level found rates, and long-term customer engagement. Each model version logs performance at the user segment level. The multi-model experimentation framework compares these results side by side using identical update frequencies. Over time, the best model is promoted, and the system collects more data to refine the next iteration. Retention metrics can be measured by comparing user cohorts exposed to different models.
Follow-up Question 5
How would you address data drift in these predictions if customer buying patterns or retailer stocks change unexpectedly?
Answer Daily full sync captures any broad shifts, while lazy refresh rapidly updates specific items. Monitoring pipelines watch for mismatch between predicted availability and actual found rate. Large deviations trigger retraining or threshold adjustments. The system can incorporate time-based weighting in the model to place more emphasis on recent data. Continual retraining ensures the model adapts to new trends or seasonal shifts. If the environment changes radically, the offline optimization loop can adjust deltas more aggressively.
Follow-up Question 6
How does contextual bandit or similar online learning improve threshold optimization compared to manually tuning deltas?
Answer Contextual bandits dynamically learn from real user interactions. The system can adjust item availability thresholds in near real-time based on reward signals such as basket completion or reorder rates. Instead of relying on static thresholds, the bandit framework explores different threshold variations, then exploits the best performing ones for each segment context. This reduces guesswork in manual tuning and can respond to shifts in user preferences or retailer dynamics automatically.
Follow-up Question 7
How would you maintain data quality when integrating new retailers or reconfiguring the storage schema?
Answer A well-defined data contract is mandatory for each new retailer’s feed. Validation checks reject malformed or incomplete entries. Feature flags control the roll-out of new schema changes. A separate migration process handles the addition of new columns or tables, ensuring backward compatibility. The multi-model storage approach is robust to schema updates because each model’s column is added independently. Metrics on ingestion error rates and missing data are closely tracked and automatically alerted.
Follow-up Question 8
If query latency starts creeping up when joining large tables, what optimizations can you apply?
Answer Precomputing join results with materialized views cuts repeated join overhead. A separate denormalized table for high-frequency lookups can hold the most needed attributes. Partition pruning on date or retailer keys reduces scan ranges. In-memory caching for short intervals can help. Query hints like index usage or parallel scans further reduce latency. Sharding large tables by geography or product category helps isolate hot partitions.
Follow-up Question 9
How do you extend this framework if you later introduce new signals, like shelf sensor data or frequent updates from store employees?
Answer A modular ingestion pipeline design allows new signals to merge into the central predictions flow. The multi-model architecture can incorporate additional features in the ML models. The threshold and delta system remains consistent because new signals just shift the distribution of predicted availability. Offline experiments determine how these signals impact the base threshold. Adding more columns for new models lets you assess the improvement before rolling it out to production.
Follow-up Question 10
What practical steps do you take to ensure your entire system is robust against downstream failures?
Answer Retries and backoff handle temporary database or stream outages. Kinesis or similar streaming services can persist messages for a buffer period. Circuit breakers and fallback logic can serve older but still relevant data if fresh updates fail. Monitoring, alerting, and dashboards quickly notify the team of anomalies. A system health check can degrade gracefully by switching to a simpler availability estimate if the ML pipeline stalls.
Follow-up Question 11
How do you reduce the risk of confounding factors when you run multiple experiments?
Answer Every model or threshold experiment uses identical update frequencies to avoid biases from timing differences. Randomized traffic splits ensure each experiment sees a representative user mix. Metric tracking compares only slices that belong to the same population distribution. Statistical significance tests confirm genuine performance differences. The multi-model approach keeps all models synchronized to the same ingest pipeline, isolating differences to model behavior rather than ingestion timing.
Follow-up Question 12
When the system transitions from a failing threshold to a better one, how do you handle the possibility of partial updates or race conditions?
Answer Transactions or atomic upserts ensure consistent writes to the database. The system might update all threshold columns first, then swap in the new threshold with a final atomic configuration change. Feature flags commit fully or roll back in case of errors. If partial updates occur, re-running the full sync ensures consistency. A short-lifetime cache or version checks guarantee stale thresholds are not served to users.
Follow-up Question 13
How do you address user trust if real-time availability mistakenly hides items that are actually in stock or shows items that are out-of-stock?
Answer Recovery starts with quick detection through signals like user complaints or shopper feedback. Automatic correction triggers a forced lazy refresh of suspicious items. The system can show fallback messages like “Might be low in stock” instead of outright hiding an item. Regular offline evaluation catches patterns where the model severely misjudges certain products or retailers, prompting improved training data or threshold adjustments. Consistency across surfaces is enforced, avoiding contradictory signals.
Follow-up Question 14
How do you measure long-term impact of these real-time predictions on customer loyalty and retention?
Answer Cohort analysis tracks repeat orders, average order sizes, and churn rates. A/B tests measure how accurate availability predictions affect user satisfaction and ordering habits. Correlating improved found rates with higher repeat purchases or net promoter scores shows the long-term business value. If mispredictions erode trust, it appears in repeat metrics. A continuous feedback loop integrates these insights into model enhancements and threshold tuning.