ML Interview Q Series: How would you use participant ratings of 100 new TV pilots to prioritize them on a streaming platform?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
One approach to extracting insights from the focus group data is to begin by computing basic descriptive statistics, then progressively refine the analysis to account for sources of bias and variability. These are some essential steps:
Initial Data Preparation
Aggregate each pilot’s ratings, capturing the total number of participants who rated it and the average rating. We might store this in a structure containing pilot_id, rating_count, sum_of_ratings, and average_rating. Since each of the 1000 participants rated 10 pilots, the total number of ratings recorded is 10,000. However, each pilot may not have the same number of ratings, since the assignment of 10 random pilots to participants might not be uniformly distributed.
Average Ratings and Confidence Intervals
A basic comparison of pilots can start with their average rating. For pilot i, let the rating from participant j be y_{ij}. If n_i participants rated pilot i, you can compute the mean rating as a straightforward sample mean. In plain text, an average rating for pilot i is: mean_i = (sum over j of y_{ij}) / n_i.
To incorporate the uncertainty in the estimate of each pilot’s mean rating, you might calculate a confidence interval around the mean to see how precise that estimate is. A typical 95% confidence interval for the mean rating might use the pilot’s sample variance and the t-distribution (if n_i is not extremely large). If the sample is moderately large, a z-approximation might be acceptable.
Below is a key formula for the estimated mean rating for pilot i in big h1 font with LaTeX centered:
Where n_i is the total number of participants who rated pilot i, and y_{ij} is the rating participant j gave to pilot i.
The variance of ratings for pilot i can be computed in plain text as var_i = (1 / (n_i - 1)) * sum over j of (y_{ij} - mean_i)^2, which captures how spread out the ratings are around that pilot’s average. If we assume approximate normality, we can derive confidence intervals:
Where z_{alpha/2} is typically around 1.96 for 95% coverage under a normal approximation. This interval provides a sense of how confident we are in the average rating estimate.
Adjusting for Participant Bias
It’s possible that some participants consistently give higher or lower scores relative to others. For example, some might rate almost every pilot between 8 and 10, while others might use the full range. To account for this, you can consider each participant’s rating offset and correct for individual bias.
One method is to transform each participant’s ratings by subtracting that participant’s personal mean rating across the 10 pilots they viewed. This yields a centered rating that captures how a participant's score for a specific pilot deviates from their own average. After centering, the pilot’s average can be recalculated based on these adjusted ratings.
You could also pursue a more complex hierarchical or mixed-effects model, where you treat participant bias and pilot rating as separate effects. This type of model more systematically captures differences among participants and among pilots.
Ranking the Pilots
Once you have corrected for potential biases, you can rank the pilots by their adjusted average rating (or by the upper or lower bound of their confidence interval if you want more conservative or risk-tolerant ranking). One way is to rank them by their lower confidence bound: if the lower bound of a pilot’s average rating is higher than the lower bound of another pilot’s rating, you might be more certain it’s performing better.
Alternatively, a Bayesian approach can incorporate prior beliefs (e.g., all new pilots start from an assumed baseline rating) and then update these beliefs with the observed data. This approach often includes shrinkage, which can help avoid overconfidence in pilots that were rated by fewer participants.
Identifying Outliers
Certain participants might have outlier behavior in their ratings. Some might give all 10s, while others might give all 1s. You can perform an outlier analysis at both the participant and the pilot level. If a pilot receives unusually polarizing ratings, that might be valuable information (it could indicate a cult-classic style show that only appeals to a subset of watchers). At the participant level, extremely inconsistent rating patterns might be excluded or down-weighted if you believe they do not represent genuine feedback.
Implementation Example in Python
import numpy as np
import pandas as pd
from scipy import stats
# Suppose we have a DataFrame: df
# with columns: ['participant_id', 'pilot_id', 'rating']
# 1. Compute average ratings per pilot
pilot_stats = df.groupby('pilot_id').agg(
rating_count=('rating','count'),
rating_mean=('rating','mean'),
rating_var=('rating','var')
).reset_index()
# 2. Compute 95% confidence intervals (assuming normal approximation)
z = 1.96
pilot_stats['std_error'] = np.sqrt(pilot_stats['rating_var'] / pilot_stats['rating_count'])
pilot_stats['ci_lower'] = pilot_stats['rating_mean'] - z * pilot_stats['std_error']
pilot_stats['ci_upper'] = pilot_stats['rating_mean'] + z * pilot_stats['std_error']
# 3. Sort by mean rating or by lower confidence bound
pilot_stats_sorted = pilot_stats.sort_values('rating_mean', ascending=False)
# or
pilot_stats_lower_sorted = pilot_stats.sort_values('ci_lower', ascending=False)
# pilot_stats_lower_sorted would list pilots with the highest guaranteed lower bound first
Practical Considerations
Analyzing such focus group data often involves thinking about the broader strategy for show selection. Real viewing behavior can differ from focus group evaluations. Some series might attract niche audiences or have the potential to generate buzz even if the median rating isn’t the highest. Thus, purely numerical ranking is only part of the picture.
Possible Follow-up Questions
How would you handle the fact that each pilot is not rated by the same number of participants, which might bias certain average ratings?
One way is to incorporate statistical methods that weigh pilots by their sample size. For instance, the Bayesian approach with a prior typically shrinks the mean estimates of pilots with fewer ratings towards the overall mean. In a frequentist perspective, we can incorporate confidence intervals that become wider for pilots with fewer ratings. If certain pilots have too few raters (significantly below average in sample size), it may be useful to discount or re-collect data for them.
How do you decide whether to remove or adjust outlier ratings, given that extreme values might sometimes be valid?
It often depends on the study’s objective. If you have reason to suspect that certain participant responses are not genuine, you might remove them. However, if extreme feedback is a genuine consumer reaction, you keep it. A robust approach is to perform a sensitivity analysis both with and without outliers to see how much the recommendations change. If the final ranking is drastically altered, you should investigate further. If it remains similar, outliers might not be exerting a significant effect.
If there is a tendency for some participants to give higher or lower scores in general, how exactly do you incorporate participant-level effects in a formal model?
You can introduce random intercepts for participants in a mixed-effects model. In such a model, each pilot gets a fixed effect representing its “true” quality, and each participant gets a random effect that captures how much they deviate from the overall mean rating. Formally, you could write:
pilot_rating = global_mean + alpha_pilot + beta_participant + error
where alpha_pilot is a fixed effect for each pilot, and beta_participant is a random effect for each participant. Fitting this model using maximum likelihood or Bayesian methods allows you to isolate the effect of each pilot’s quality from each participant’s rating bias.
How might you interpret a situation in which a particular pilot receives a very large variance in its ratings?
A large variance suggests polarizing opinions. Some participants might love it while others dislike it. This can still be valuable information. Shows that create buzz due to controversy or strong differences of opinion can be successful in niche markets. One approach is to segment participants who liked it from those who didn’t and examine their demographics or preferences. You might then consider a more targeted marketing or distribution strategy. Or you might combine the high variance with the pilot’s average rating or median rating to gauge its overall potential appeal.
Could you apply a recommender system technique like matrix factorization to this focus group data?
Yes, you can treat the pilot ratings as part of a user-item matrix where users are participants and items are pilots. Even with the sparse nature of each participant rating only 10 shows, you could attempt matrix factorization or latent factor models. Such models can help uncover relationships between pilots and participants that aren’t apparent through a simple mean rating. For instance, it might reveal that certain genres or features are popular with specific user segments. However, given the limited coverage of the 10 random pilots per user, the matrix would be quite sparse, so additional data or constraints might be needed to get robust latent factors.
Below are additional follow-up questions
How do you account for varying engagement levels among participants, such as those who watched only part of a pilot before rating it?
One subtlety is that participants might differ in how attentively they watch each pilot. For instance, some participants could lose interest halfway through and provide a hurried rating. If we treat all ratings equally, we risk over- or underestimating a pilot's true quality.
Potential Pitfalls and Real-World Issues
Partial Exposure Bias: Participants who didn’t complete a pilot might rate lower or higher arbitrarily based on incomplete impressions.
Unlogged Exits: Some participants might leave early and never submit a final rating, skewing the data if those partial watchers systematically differ from those who remain till the end.
Overestimation of Engagement: If many participants watch only half a pilot, you might incorrectly conclude the pilot is boring or unengaging, when it could be that participants were simply short on time.
Possible Approaches
Filter Out Partial Watchers: In the simplest case, drop ratings from those who didn’t watch a significant portion of the pilot. Though this reduces data points, it can yield cleaner comparisons.
Weighting by Engagement: Another option is to weight each rating by the fraction of the pilot watched. This way, a rating from someone who watched 80% is given more weight than from someone who watched only 20%.
Additional Features: Track the total watch time and incorporate it into a combined metric, e.g., a blended rating plus watch duration factor.
What if the rating scale is ordinal (1-10) rather than truly numeric, and some participants only used a narrow range (e.g., 7-9)?
Ordinal ratings imply that a “7” and an “8” can’t necessarily be compared with standard arithmetic difference. A participant who rarely uses values below 6 effectively compresses their scale into a narrow band.
Potential Pitfalls and Real-World Issues
Invalid Aggregation: Simply computing a mean rating of 7.5 vs. 8.0 might not truly reflect a meaningful difference in perceived quality for ordinal scales.
Participant-Specific Bias: One participant’s 8 could be another participant’s 5 in relative terms.
Possible Approaches
Ordinal Models: Use statistical techniques (e.g., ordinal regression) that respect the rank-based nature of the scale. Instead of treating differences on a 1–10 scale equally, you treat them as ordinal ranks.
Distribution Analysis: Inspect how each participant uses the range. You might attempt a normalizing transformation—e.g., centering around each participant’s median rating—to standardize usage of the scale before aggregating.
How would you address possible correlations between pilot characteristics (e.g., genre or cast) and participant demographic groups?
A hidden factor might be that certain demographic groups prefer, say, comedies over dramas, or have a bias toward well-known actors.
Potential Pitfalls and Real-World Issues
Confounded Ratings: If a large portion of participants for a specific pilot are from a demographic that strongly favors its genre, the pilot may get inflated ratings relative to the general audience.
Skewed Sampling: The random assignment of pilots to participants might still end up with correlated patterns if not carefully stratified.
Possible Approaches
Stratified Random Sampling: Before the focus group, ensure that each pilot is shown to a demographically representative subset of participants.
Post-stratification Weighting: After collecting data, adjust the aggregated rating by demographic proportions so that the final aggregated score reflects a more balanced view.
Hierarchical Modeling with Demographics: Build a regression or mixed-effects model where you include participant-specific demographics as additional covariates to partial out demographic preferences.
What if the distribution of ratings for some pilots is strongly non-normal, with heavy tails or strong skewness?
A standard mean and standard deviation approach assumes roughly normal-like behavior. But real-world ratings can exhibit heavy-tailed distributions, especially for polarizing content.
Potential Pitfalls and Real-World Issues
Misleading Means and Standard Deviations: Means can be heavily influenced by a small number of extreme ratings.
Bi-modal or Multi-modal Distributions: Some pilots could get mostly very high or very low ratings with fewer “middle” ratings, invalidating typical “average plus standard deviation” analyses.
Possible Approaches
Robust Statistics: Use medians or trimmed means that reduce the impact of extreme outliers.
Distribution-Fitting: If data is clearly heavy-tailed, consider a distribution like a skewed logistic or a beta distribution for rating data, and estimate parameters accordingly.
Non-parametric Inference: Use rank-based or bootstrapping methods for constructing confidence intervals around medians or percentiles.
How might you handle pilots that have only a handful of ratings and thus lack statistical reliability?
Random assignment might not yield uniform coverage, so some pilots might end up with far fewer than the average number of ratings.
Potential Pitfalls and Real-World Issues
Excessive Variance: A pilot with only 5 ratings might show an average rating of 9.0, but that might be based on limited data.
Uncertain Comparison: It’s risky to compare a pilot with 5 ratings to one with 200 ratings purely by average.
Possible Approaches
Minimum Rating Threshold: Exclude or discount pilots with too few ratings until additional data is collected.
Bayesian Shrinkage: Start with a prior (such as an overall average rating across all pilots) and pull the observed mean for small-n pilots toward that prior to avoid overconfidence.
Separate Analysis: Group pilots by rating count and consider a separate analysis or data-collection plan for low-sample pilots.
How can you detect and address the possibility that certain participants are “speed rating” or inattentive during the focus group?
Some participants might rush through the process, giving potentially random or meaningless scores.
Potential Pitfalls and Real-World Issues
Uniform Ratings: A participant who gives every pilot the same score within a very short time frame might not have truly watched them.
Misleading Extremes: An inattentive participant might use only the lowest or highest possible rating for convenience.
Possible Approaches
Time-based Filtering: Track the time it takes each participant to go through each pilot. If it’s unreasonably short, consider dropping or flagging that participant’s ratings.
Consistency Checks: Look for suspicious patterns (like rating all pilots identically or in a repeating sequence of scores).
Engagement Quizzes: In real-world studies, you can incorporate short quizzes about the pilot content to verify participant engagement. Inattentive participants who consistently fail can have their data removed or down-weighted.