ML Case-study Interview Question: Personalizing Recommendations for Repeat Listens: Heuristics vs. Neural Networks

Rohan Paul

Apr 20, 2025

Browse all the ML Case-Studies here.

Case-Study question

A large streaming platform observed that most users repeatedly listen to a small set of content. The product team introduced a new feature at the top of the homepage displaying six items that represent a user’s recent or heavily played music or podcasts. The goal is to let users quickly resume their familiar content without searching. How would you, as a Senior Data Scientist, design and build this personalized recommendation feature from scratch? Explain how you would (1) analyze usage data to prove there is a strong need for showing repeated items, (2) build a simple heuristic for these six recommendations, (3) compare it to an advanced neural network model, and (4) ensure reliable evaluation and production performance for hundreds of millions of users?

Connect with me on X (Twitter)

Detailed solution

Understanding repeated usage

Usage logs for the past month can reveal what fraction of streams comes from a small group of items. Engineers can check how often the same albums, playlists, or podcasts appear in a user’s weekly history. This indicates whether a dedicated space for these favorites reduces effort when replaying them.

Heuristic approach

A heuristic can rank items by a decay-weighted frequency. One option is to count how many times each item was played over the last 90 days, assigning greater weight to recent streams. A form of exponential weighting can be defined inline as w = e^(-alpha * t), where t is the age of the last play in days. The top six items become candidates for the homepage. This method is straightforward, easy to tweak, and fast to serve.

Neural network model

When heuristics grow complicated, a trained model can output a probability for each item being played next. The training set includes sequences of user-item interactions. A typical approach uses dense layers to embed each play event’s metadata (timestamp, active vs passive plays, item IDs), then aggregates these embeddings per item to predict the probability of future plays.

Offline evaluation

Researchers use historical logs to see if the top six items chosen by the model appear in a user’s subsequent sessions. A standard metric is Normalized Discounted Cumulative Gain. Its formula, in big font, often appears as:

where DCG@k_u measures the relevance of recommended items for user u, and IDCG@k_u is the best possible DCG@k for that user. If the neural network significantly outperforms heuristics in offline NDCG@k and coverage metrics, the model is a strong candidate for production.

Online experiments

A/B tests display different recommendation strategies to diverse user groups. One group sees the heuristic, another sees the model. Engineers measure how many times users pick the top six items over a given period. They verify correlation between offline gains and online uplift. If the model shows a notable boost, it replaces the heuristic.

Production and monitoring

Serving the final model requires pipelines for feature generation, model inference, and real-time updates. Monitoring dashboards track item coverage, user engagement, and potential data anomalies. Hourly batch pipelines emit statistics on serving features in storage. Automated alerts warn of suspicious drops in coverage or surges in error rates. When issues arise, logs and counters help isolate the root cause in upstream data, feature transformations, or the model itself.

Example training code snippet

import tensorflow as tf

# Suppose we have a feature dict: user_id, item_id, timestamp, active_flag
# We create embeddings for each feature and combine them in dense layers.

user_id_inp = tf.keras.Input(shape=(1,), name='user_id', dtype=tf.int32)
item_id_inp = tf.keras.Input(shape=(1,), name='item_id', dtype=tf.int32)
timestamp_inp = tf.keras.Input(shape=(1,), name='timestamp', dtype=tf.float32)
active_flag_inp = tf.keras.Input(shape=(1,), name='active_flag', dtype=tf.float32)

# Simple embedding layers (sizes and vocab for illustration only):
user_emb = tf.keras.layers.Embedding(input_dim=1000000, output_dim=32)(user_id_inp)
item_emb = tf.keras.layers.Embedding(input_dim=2000000, output_dim=32)(item_id_inp)

# Flatten embeddings:
user_vec = tf.keras.layers.Flatten()(user_emb)
item_vec = tf.keras.layers.Flatten()(item_emb)

# Combine everything:
concat_vec = tf.keras.layers.Concatenate()([user_vec, item_vec, timestamp_inp, active_flag_inp])
dense1 = tf.keras.layers.Dense(64, activation='relu')(concat_vec)
dense2 = tf.keras.layers.Dense(32, activation='relu')(dense1)
output = tf.keras.layers.Dense(1, activation='sigmoid')(dense2)

model = tf.keras.Model(inputs=[user_id_inp, item_id_inp, timestamp_inp, active_flag_inp], outputs=output)
model.compile(optimizer='adam', loss='binary_crossentropy')

# Train this model on historical data, predict the probability of a user replaying an item next.
# Then rank items by predicted score for the top 6 recommendations.

How to handle tricky follow-ups

Handling changing user behavior over the day

A user might listen to a news podcast in the morning, an energetic playlist mid-afternoon, and an audiobook at night. One solution is to incorporate time-of-day features in both heuristics and the neural net. Timestamp features can factor in repeating patterns. If data suggests high variance in listening behavior, the system can refresh predictions more frequently or have separate daypart models.

Deciding offline vs online experiments

Pure offline analysis is fast but does not reveal interaction changes caused by the new interface. Online A/B tests are more realistic but require careful rollout and enough traffic for significance. Teams typically start offline for rapid iteration. Models that look promising go through small-scale online tests before a full rollout. If the feature lifts engagement, it proceeds to production.

Ensuring system reliability

A real-time pipeline can fail if an upstream data feed breaks. Alerts must check for data completeness, e.g., verifying that the count of events per hour stays within normal ranges. A staged rollout for new models reduces risk. The old heuristic can remain as a fallback. If the model produces unexpected outputs for certain user segments, the system logs can show how features were computed, making it simpler to debug.

Justifying model complexity

A neural network’s potential benefit is better personalization at scale. It can learn nuanced patterns that simple frequency-based methods miss. If offline gains do not translate online, engineers check data coverage, training/test mismatches, or latency overhead. If a simpler heuristic is nearly as good and is more reliable, the heuristic might be enough until it fails to keep up with new personalization demands.

Rohan's Bytes

Discussion about this post