ML Interview Q Series: How would you validate if human-rated relevance scores influence click-through rate using Facebook search logs?

May 01, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

The main idea is to calculate the CTR (click-through rate) for each rating level. By comparing CTR across different ratings, you can see whether highly rated results (e.g., rating of 5) tend to have a higher CTR compared to lower rated ones (e.g., rating of 1).

Connect with me on X (Twitter)

CTR is commonly defined as the number of clicks divided by the total number of impressions (or total results shown to the user):

Where:

Number of Clicks is the sum of occurrences where has_clicked is true (or 1).
Number of Impressions is the total number of search results displayed (or total rows in the combined table for each rating).

By grouping the data according to the rating column and aggregating clicks versus total impressions, we can see whether a higher rating correlates with an increased CTR.

Potential SQL Query Approach

Below is an example SQL query (syntax may vary by SQL dialect). The idea is to join the search_events table with the search_results table on the matching query and position (assuming these two tables can be joined this way), and then group by the rating:

SELECT
    sr.rating AS rating,
    COUNT(*) AS total_impressions,
    SUM(CASE WHEN se.has_clicked = 1 THEN 1 ELSE 0 END) AS total_clicks,
    SUM(CASE WHEN se.has_clicked = 1 THEN 1 ELSE 0 END) * 1.0 / COUNT(*) AS CTR
FROM search_events se
JOIN search_results sr
    ON se.query = sr.query
    AND se.position = sr.position
GROUP BY sr.rating
ORDER BY sr.rating;

Explanation of the query:

We join search_events (se) with search_results (sr) on matching columns that identify the same result: query and position.
We group the rows by sr.rating, ensuring each rating level gets aggregated.
COUNT(*) gives the total impressions for each rating (i.e., the total number of times a result of that rating was shown).
We then compute the total number of clicks for each rating by summing up 1 whenever has_clicked = 1.
Finally, CTR is computed as total_clicks / total_impressions.
Multiplying by 1.0 (or casting to FLOAT) ensures a floating-point division rather than an integer division.
Ordering by sr.rating makes the output neatly sorted by rating level.

Why This Query Can Support or Disprove the Hypothesis

If CTR tends to be higher for higher-rated search results and lower for poorly rated ones, that indicates a correlation between the rating and CTR. Conversely, if all rating levels show roughly the same CTR, it suggests rating may not influence clicks.

Potential Pitfalls

It’s essential to keep in mind:

Position Bias: Users often click the top results regardless of rating. Consider controlling for position if you see skewed data from top ranks.
Query Type Differences: If some queries have inherently more “clickable” content, rating might correlate with specific query classes rather than direct user preference.
Sample Size Issues: Rarely assigned ratings (like rating 1 or rating 5) might have too few impressions to yield reliable CTR metrics.
Human Labeling Variances: The rating is a human-provided score. Different raters or rating guidelines can introduce variability into what “high relevance” means.

Further Considerations

One might refine the investigation with:

Position-Level Adjustment: Group by (rating, position) to see if rating matters once you control for position.
Statistical Significance: Calculate confidence intervals or run a hypothesis test (e.g., chi-square) to see if any observed CTR differences are statistically significant.
Time-Based or Demographic Splits: CTR might also vary depending on time (e.g., day vs. night) or user demographics (if available and allowed for analysis).

Possible Follow-Up Questions

How would you handle confounding variables like position when checking if rating affects CTR?

Position plays a major role in CTR. Even if a result has a low rating, if it ranks first, it might still get clicked often. To account for that, you could break down the CTR by both rating and position. For instance:

SELECT
    sr.rating AS rating,
    sr.position AS position,
    COUNT(*) AS total_impressions,
    SUM(CASE WHEN se.has_clicked = 1 THEN 1 ELSE 0 END) AS total_clicks,
    SUM(CASE WHEN se.has_clicked = 1 THEN 1 ELSE 0 END) * 1.0 / COUNT(*) AS CTR
FROM search_events se
JOIN search_results sr
    ON se.query = sr.query
    AND se.position = sr.position
GROUP BY sr.rating, sr.position;

Then you compare CTR within the same position but across different ratings. If rating still shows a consistent effect within each position bucket, it strengthens the hypothesis that rating drives CTR.

Should we consider any statistical testing approach to confirm the hypothesis?

Yes. Merely observing different CTRs may not guarantee statistical significance. To formalize this, you can:

Conduct a chi-square test comparing clicks versus no-clicks across different rating categories.
Use a z-test or logistic regression to see if rating is a significant predictor of clicks while controlling for other factors like position.

How would you implement this analysis in a data science environment, like Pandas or PySpark?

Below is a small snippet in Python using Pandas. Assume df_events contains the search events and df_results contains the search results:

import pandas as pd

# Merge on query and position
df_merged = pd.merge(df_events, df_results, on=['query', 'position'], how='inner')

# Compute aggregations
grouped = df_merged.groupby('rating').agg(
    total_impressions=('has_clicked', 'size'),
    total_clicks=('has_clicked', lambda x: x.sum())
).reset_index()

# Compute CTR
grouped['CTR'] = grouped['total_clicks'] / grouped['total_impressions']

print(grouped)

This code shows how to group by rating and compute total impressions (the count of rows) and total clicks (the sum of has_clicked, assuming has_clicked is 0/1). Then it calculates the CTR by dividing total clicks by total impressions.

How could query-level differences distort this analysis?

Some queries might be navigational (where the user is searching for a specific known entity) versus exploratory or informative (where the user might browse multiple results). This can affect CTR in ways unrelated to rating. If the user is explicitly searching for “Facebook login,” they might click the top link no matter its rating.

To mitigate this, you could:

Group or segment data by query type (navigational, informational, transactional).
Run separate analyses to see if the rating’s influence on CTR holds across different query types.

How would you present this result to non-technical stakeholders?

Show a simple bar chart of CTR across each rating bucket (1 through 5).
If needed, highlight confidence intervals to indicate uncertainty in measurements.
Emphasize practical recommendations, such as ranking or surfacing higher-rated content more prominently if indeed it shows a statistically higher CTR.

By diving deeper into these considerations and follow-up questions, an interviewer can assess a candidate’s ability to handle real-world complexities and interpret results beyond a simple aggregated metric.

Below are additional follow-up questions

What if there is missing data in the rating column?

One significant concern is incomplete or missing ratings in the dataset. If many rows lack a valid rating, it can distort the overall CTR comparison. Sometimes, human raters might skip rating results or systems might record incomplete data. This can introduce bias because rows with missing ratings may disproportionately belong to a certain query category or position (for example, older queries might not have been rated when the rating system was first introduced).

Potential pitfalls and considerations

Excluding Missing Data: Simply dropping rows without a rating can bias the analysis if missingness correlates with user clicks or particular types of queries.
Imputation Strategies: If missing data is widespread, you may consider techniques like median or mode imputation (e.g., assume missing ratings are some typical value). However, improper imputation can artificially inflate or deflate CTR for a certain rating bucket.
Segment Analysis: Analyze the distribution of missing data by position, query type, or time to see if it follows any pattern. If missingness skews heavily toward a certain dimension, you might need a more nuanced approach to avoid misinterpretation.

How do you handle the possibility that the rating distribution is heavily skewed?

When rating data is imbalanced (e.g., most ratings are 4 or 5), it can be difficult to compare the CTR across sparse categories like rating 1 or 2. A small number of rows in a lower rating bucket may produce a CTR that looks artificially high or low just by chance.

Potential pitfalls and considerations

Confidence Intervals: Whenever working with limited data in a rating bucket, the sample size might be too small for conclusive interpretation. Estimating confidence intervals or using Bayesian approaches can highlight the uncertainty.
Combining Adjacent Ratings: If certain ratings are extremely rare (e.g., rating 1 or rating 2 have very few rows), you might group them together for statistical reliability. However, grouping different ratings might lose granularity.
Stratified Sampling: If feasible, consider oversampling underrepresented ratings when gathering fresh data or weighting them more carefully when computing summary statistics.

What about user-level or session-level correlation?

A single user might conduct many searches in one session. If the same user consistently clicks or does not click certain ratings, the data points from that user are not independent. Similarly, some sessions may have an exploratory mindset (leading to multiple clicks) while others might be navigational (where the user clicks once and leaves).

Potential pitfalls and considerations

Overcounting Active Users: Heavily active users could dominate the dataset, skewing CTR calculations toward their behavior.
Session Grouping: Consider grouping data by user session to analyze how CTR evolves within a single session. If a user has already clicked a result with rating 5, does that affect subsequent clicks on other high-rated results?
Mixed-Effects Modeling: A more advanced statistical approach could treat user or session as a random effect, isolating the influence of rating on CTR from individual user behavior patterns.

Could there be multiple ratings for the same search result across different queries or time frames?

Sometimes a particular URL or item might get different ratings for different queries or over time if re-labeled by different human raters. This can create inconsistencies in the dataset, with the “same” result (in terms of final destination) carrying multiple relevance scores.

Potential pitfalls and considerations

Conflicting Ratings: If one rater gives a result a 5 and another rater gives it a 3, it might be unclear which rating to use. Merging or averaging these ratings might be necessary, but it can dilute the signal.
Contextual Relevance: A result can be highly relevant for one type of query and barely relevant for another. Splitting the analysis by query clusters can help keep the rating meaningful to the query intent.
Time Decay: Relevance might change as content gets updated or becomes outdated. Tracking ratings over time (e.g., using the most recent rating or applying a time-based weight) might produce a more accurate reflection of real relevance.

How could delayed clicks or revisits affect the measured CTR?

Not all clicks occur immediately. Sometimes users return to the search results page after exploring a link, or they might click a result hours or days later if the session remains tracked. Standard CTR calculations often assume an impression and click happen closely in time.

Potential pitfalls and considerations

Session Definitions: The definition of a “session” can vary. If the user returns after a long interval, should it count as a new impression? This can affect the denominator in CTR.
Attribution Windows: Decide on a specific time window to attribute a click to an impression. For instance, anything beyond 30 minutes or an hour might be considered a new session or a separate impression.
Multiple Clicks: A single user might click the same result multiple times. Typically, CTR analysis focuses on whether at least one click occurred or not. Multiple clicks might indicate deeper interest, but it complicates the standard CTR definition.

How might you incorporate alternative engagement metrics, like dwell time or bounce rate, instead of just clicks?

Sometimes a short click (quick bounce) is not as meaningful as a longer engagement with the page. Dwell time—the time spent on a link’s landing page—could be more indicative of user satisfaction than a simple click.

Potential pitfalls and considerations

Measuring True Engagement: CTR doesn’t necessarily distinguish between random clicks and genuinely helpful visits. Including dwell time or bounce rates can filter out misleading clicks.
Data Availability: Tracking dwell time or bounce might require additional logging or instrumentation. Not all datasets will have consistent or reliable dwell time metrics.
Complexity of Analysis: Summarizing dwell time might require a distribution-based approach (e.g., average session length, proportion of extremely short visits). This is more nuanced than a simple click/no-click measure.

What if the rating system itself is subjective or inconsistent across different raters?

Human-generated ratings can be subjective, with each rater having a slightly different interpretation of the scale. If the training or calibration of raters is uneven, rating 5 for one rater may be equivalent to rating 4 for another. This introduces noise or bias in the rating distribution.

Potential pitfalls and considerations

Inter-Rater Reliability: Track metrics like Cohen’s kappa or Krippendorff’s alpha to evaluate how consistently raters apply the rating scale. Low inter-rater reliability means the rating data may be noisy.
Consensus Ratings: Use an average of multiple raters for each result, if available, to mitigate individual biases. This can be more accurate but requires more resources.
Adaptive Rating Scales: If feasible, employ an adaptive rating process (e.g., pairwise comparisons or repeated measures) to refine uncertain ratings. This may improve the consistency of labels over time.

Could bot-generated or spammy clicks inflate the CTR for certain ratings?

Automated traffic or malicious actors might click certain links repeatedly, influencing the overall click count. If bots specifically target certain positions or types of results, the aggregated CTR could become less meaningful.

Potential pitfalls and considerations

Anomaly Detection: Implement rules or models that flag unusual click patterns, like excessively high clicks from the same user or IP in a short timespan.
Filtering Known Bots: Exclude known bot traffic sources from the dataset, often identified by user-agent strings or IP addresses. This helps ensure you’re tracking only genuine user interactions.
Data Sanitization: If there’s any suspicion that certain rating categories were targeted in an experiment or had inflated traffic, segment your analysis or remove outliers to avoid skewed CTR results.

How do we ensure the final analysis is interpretable for stakeholders who do not have a deep analytics background?

While the numeric outputs (CTR values) can be shown in a table, it’s crucial to provide clear, intuitive visualizations and context for why CTR might vary by rating. Non-experts may focus on overall trends rather than granular statistics.

Potential pitfalls and considerations

Overloading With Detail: Bombarding stakeholders with too many technical details (like position bias corrections or advanced statistical tests) can be overwhelming. Focus on the main CTR trend while making the deeper analyses available for those who request it.
Storytelling: Use charts that illustrate how CTR changes with rating, highlighting confidence intervals or top-level insights (e.g., “Results with rating 5 yield 15% higher CTR on average compared to rating 3”).
Actionable Insights: Frame the conversation around what can be done with the findings—like increasing the exposure of content that’s rated 5 or refining the rating guidelines to ensure consistency.

Rohan's Bytes

Discussion about this post