ML Interview Q Series: How would you use participant ratings of 100 new TV pilots to prioritize them on a streaming platform?

May 02, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

One approach to extracting insights from the focus group data is to begin by computing basic descriptive statistics, then progressively refine the analysis to account for sources of bias and variability. These are some essential steps:

Connect with me on X (Twitter)

Initial Data Preparation

Aggregate each pilot’s ratings, capturing the total number of participants who rated it and the average rating. We might store this in a structure containing pilot_id, rating_count, sum_of_ratings, and average_rating. Since each of the 1000 participants rated 10 pilots, the total number of ratings recorded is 10,000. However, each pilot may not have the same number of ratings, since the assignment of 10 random pilots to participants might not be uniformly distributed.

Average Ratings and Confidence Intervals

A basic comparison of pilots can start with their average rating. For pilot i, let the rating from participant j be y_{ij}. If n_i participants rated pilot i, you can compute the mean rating as a straightforward sample mean. In plain text, an average rating for pilot i is: mean_i = (sum over j of y_{ij}) / n_i.

To incorporate the uncertainty in the estimate of each pilot’s mean rating, you might calculate a confidence interval around the mean to see how precise that estimate is. A typical 95% confidence interval for the mean rating might use the pilot’s sample variance and the t-distribution (if n_i is not extremely large). If the sample is moderately large, a z-approximation might be acceptable.

Below is a key formula for the estimated mean rating for pilot i in big h1 font with LaTeX centered:

Where n_i is the total number of participants who rated pilot i, and y_{ij} is the rating participant j gave to pilot i.

The variance of ratings for pilot i can be computed in plain text as var_i = (1 / (n_i - 1)) * sum over j of (y_{ij} - mean_i)^2, which captures how spread out the ratings are around that pilot’s average. If we assume approximate normality, we can derive confidence intervals:

Where z_{alpha/2} is typically around 1.96 for 95% coverage under a normal approximation. This interval provides a sense of how confident we are in the average rating estimate.

Adjusting for Participant Bias

It’s possible that some participants consistently give higher or lower scores relative to others. For example, some might rate almost every pilot between 8 and 10, while others might use the full range. To account for this, you can consider each participant’s rating offset and correct for individual bias.

One method is to transform each participant’s ratings by subtracting that participant’s personal mean rating across the 10 pilots they viewed. This yields a centered rating that captures how a participant's score for a specific pilot deviates from their own average. After centering, the pilot’s average can be recalculated based on these adjusted ratings.

You could also pursue a more complex hierarchical or mixed-effects model, where you treat participant bias and pilot rating as separate effects. This type of model more systematically captures differences among participants and among pilots.

Ranking the Pilots

Once you have corrected for potential biases, you can rank the pilots by their adjusted average rating (or by the upper or lower bound of their confidence interval if you want more conservative or risk-tolerant ranking). One way is to rank them by their lower confidence bound: if the lower bound of a pilot’s average rating is higher than the lower bound of another pilot’s rating, you might be more certain it’s performing better.

Alternatively, a Bayesian approach can incorporate prior beliefs (e.g., all new pilots start from an assumed baseline rating) and then update these beliefs with the observed data. This approach often includes shrinkage, which can help avoid overconfidence in pilots that were rated by fewer participants.

Identifying Outliers

Certain participants might have outlier behavior in their ratings. Some might give all 10s, while others might give all 1s. You can perform an outlier analysis at both the participant and the pilot level. If a pilot receives unusually polarizing ratings, that might be valuable information (it could indicate a cult-classic style show that only appeals to a subset of watchers). At the participant level, extremely inconsistent rating patterns might be excluded or down-weighted if you believe they do not represent genuine feedback.

Implementation Example in Python

import numpy as np
import pandas as pd
from scipy import stats

# Suppose we have a DataFrame: df
# with columns: ['participant_id', 'pilot_id', 'rating']

# 1. Compute average ratings per pilot
pilot_stats = df.groupby('pilot_id').agg(
    rating_count=('rating','count'),
    rating_mean=('rating','mean'),
    rating_var=('rating','var')
).reset_index()

# 2. Compute 95% confidence intervals (assuming normal approximation)
z = 1.96
pilot_stats['std_error'] = np.sqrt(pilot_stats['rating_var'] / pilot_stats['rating_count'])
pilot_stats['ci_lower'] = pilot_stats['rating_mean'] - z * pilot_stats['std_error']
pilot_stats['ci_upper'] = pilot_stats['rating_mean'] + z * pilot_stats['std_error']

# 3. Sort by mean rating or by lower confidence bound
pilot_stats_sorted = pilot_stats.sort_values('rating_mean', ascending=False)
# or
pilot_stats_lower_sorted = pilot_stats.sort_values('ci_lower', ascending=False)

# pilot_stats_lower_sorted would list pilots with the highest guaranteed lower bound first

Practical Considerations

Analyzing such focus group data often involves thinking about the broader strategy for show selection. Real viewing behavior can differ from focus group evaluations. Some series might attract niche audiences or have the potential to generate buzz even if the median rating isn’t the highest. Thus, purely numerical ranking is only part of the picture.

Possible Follow-up Questions

How would you handle the fact that each pilot is not rated by the same number of participants, which might bias certain average ratings?

One way is to incorporate statistical methods that weigh pilots by their sample size. For instance, the Bayesian approach with a prior typically shrinks the mean estimates of pilots with fewer ratings towards the overall mean. In a frequentist perspective, we can incorporate confidence intervals that become wider for pilots with fewer ratings. If certain pilots have too few raters (significantly below average in sample size), it may be useful to discount or re-collect data for them.

How do you decide whether to remove or adjust outlier ratings, given that extreme values might sometimes be valid?

It often depends on the study’s objective. If you have reason to suspect that certain participant responses are not genuine, you might remove them. However, if extreme feedback is a genuine consumer reaction, you keep it. A robust approach is to perform a sensitivity analysis both with and without outliers to see how much the recommendations change. If the final ranking is drastically altered, you should investigate further. If it remains similar, outliers might not be exerting a significant effect.

If there is a tendency for some participants to give higher or lower scores in general, how exactly do you incorporate participant-level effects in a formal model?

You can introduce random intercepts for participants in a mixed-effects model. In such a model, each pilot gets a fixed effect representing its “true” quality, and each participant gets a random effect that captures how much they deviate from the overall mean rating. Formally, you could write:

pilot_rating = global_mean + alpha_pilot + beta_participant + error

where alpha_pilot is a fixed effect for each pilot, and beta_participant is a random effect for each participant. Fitting this model using maximum likelihood or Bayesian methods allows you to isolate the effect of each pilot’s quality from each participant’s rating bias.

How might you interpret a situation in which a particular pilot receives a very large variance in its ratings?

A large variance suggests polarizing opinions. Some participants might love it while others dislike it. This can still be valuable information. Shows that create buzz due to controversy or strong differences of opinion can be successful in niche markets. One approach is to segment participants who liked it from those who didn’t and examine their demographics or preferences. You might then consider a more targeted marketing or distribution strategy. Or you might combine the high variance with the pilot’s average rating or median rating to gauge its overall potential appeal.

Could you apply a recommender system technique like matrix factorization to this focus group data?

Yes, you can treat the pilot ratings as part of a user-item matrix where users are participants and items are pilots. Even with the sparse nature of each participant rating only 10 shows, you could attempt matrix factorization or latent factor models. Such models can help uncover relationships between pilots and participants that aren’t apparent through a simple mean rating. For instance, it might reveal that certain genres or features are popular with specific user segments. However, given the limited coverage of the 10 random pilots per user, the matrix would be quite sparse, so additional data or constraints might be needed to get robust latent factors.

Below are additional follow-up questions

How do you account for varying engagement levels among participants, such as those who watched only part of a pilot before rating it?

One subtlety is that participants might differ in how attentively they watch each pilot. For instance, some participants could lose interest halfway through and provide a hurried rating. If we treat all ratings equally, we risk over- or underestimating a pilot's true quality.