ML Case-study Interview Question: Causal Inference with Synthetic Controls & DML for Platform Growth
Browse all the ML Case-Studies here.
Case Study
A large online platform wants to measure the impact of newly released product features on user growth, retention, and revenue. They also have mobile games in their ecosystem and roll out different campaigns at the country level. They rarely observe users for a full year in an A/B test, but they must project long-term gains or losses. They sometimes skip setting up controlled A/B tests before launching major game updates or marketing pushes. They run surveys to gather insights on user preferences, but survey non-response bias skews their data. They need a robust end-to-end causal inference framework to handle these scenarios.
Question
You are a Senior Data Scientist. You must propose a method to:
Project long-term business metrics from short A/B tests.
Evaluate causal impact of game-related interventions when classic A/B tests are impossible.
Weigh tradeoffs between competing metrics.
Address heterogeneous non-response bias in surveys.
Ensure design choices of results dashboards facilitate quick, data-driven decisions.
How would you build and justify your approach? Outline each step in detail. Provide algorithms, code snippets, or pseudo-code where appropriate.
Detailed Solution
1) Projecting Long-Term Metrics from Short A/B Tests
Goal: Estimate annualized revenue or retention impact from a short test window.
Core Idea: Use a surrogate variable (e.g., early retention) that reliably captures part of the causal path toward the final outcome. Combine this with assumptions about how the treatment effect evolves over unobserved billing cycles and cohorts.
Steps:
Train a retention model that predicts user survival or churn probability.
Observe short-run treatment effects during the test (e.g., first and second billing cycles).
Project forward the unobserved periods by using the surrogate-based retention predictions for cycles 3 through 12.
Assume transportability of treatment effects across subsequent cohorts if long-running tests are not available.
Validate by comparing projections with actual data from extended tests where feasible.
Why This Helps: It automates manual forecasting, saves time, and consistently incorporates retention data to approximate missing later-cycle effects.
Example Code Snippet (Plain Text):
# Pseudocode for projecting future revenue impact
# observed_treatment_effects: dictionary with keys = billing_period,
# values = revenue lift from the test
# retention_model: predictive model for user retention
future_projections = {}
for period in range(3, 13):
predicted_retention_ratio = retention_model.predict(period)
# scale the observed effect from period 2 using predicted retention
projected_effect = observed_treatment_effects[2] * predicted_retention_ratio
future_projections[period] = projected_effect
# sum up total projected effect
annualized_effect = sum(observed_treatment_effects.values()) + sum(future_projections.values())
2) Evaluating Game Interventions Without Traditional A/B Tests
Goal: Measure impact of region-level or game-level changes when no control group is available or when the entire population receives the treatment.
Core Idea: Implement synthetic control approaches to construct a pseudo-control from historical data or comparable units.
Steps:
Choose a set of control units that did not receive the intervention (e.g., other games or countries that remained unchanged).
Combine these controls into a “synthetic” unit that mirrors pre-intervention metrics for the treated game/region.
Compare post-intervention outcomes of the actual treated unit with the synthetic counterpart.
Use robustness checks like backdating to confirm stability.
Why This Helps: It provides a way to measure causal impact when randomization is not feasible.
3) Double Machine Learning for Metric Tradeoffs
Goal: When multiple metrics move in different directions, quantify each metric’s causal effect on a north-star outcome (e.g., overall retention).
Core Idea: Use Double Machine Learning (DML) with an Augmented Inverse Propensity Weighting (AIPW) estimator. Discretize treatment levels to compute Average Treatment Effects on a consistent population.
Steps:
Choose the metric or feature you suspect impacts retention (e.g., gameplay time vs. streaming hours).
Fit a propensity model to estimate probability of each treatment level.
Compute AIPW estimates of the effect on retention for each treatment level.
Weight these estimates by the real baseline distribution to form an apples-to-apples comparison.
Why This Helps: It yields fair comparisons of different interventions or metrics, especially when user-level heterogeneity causes bias in simpler partially linear models.
4) Survey A/B Tests with Heterogeneous Non-Response Bias
Goal: When surveying users, the treatment might alter who responds, introducing confounding. This skews downstream guardrail metrics.
Core Idea: Combine propensity score re-weighting for internal validity (balancing responders in treatment vs. control) with iterative proportional fitting for external validity (matching overall population profiles).
Steps:
Split respondents into strata by key covariates that drive response rates (e.g., user tenure).
Compute conditional average treatment effects within each stratum.
Use propensity scores to align the distribution of stratum membership across treatment arms.
Apply iterative proportional fitting to re-align respondents to the known distribution of the total user base if needed.
Calculate final average treatment effects with these corrected weights.
Why This Helps: It corrects for different response biases that can appear across test variants.
5) Designing Visual Dashboards for Clarity
Goal: Present causal inference results in a way that stakeholders can interpret correctly.
Key Considerations:
For proportions or parts of a whole, visualize with clear pie or stacked bar charts.
For direct numeric comparisons, simple bar charts or tables.
Keep interactive elements minimal if the data is small and mostly static.
Provide short annotations for key metrics (e.g., “Projected 3% revenue lift in the next billing cycle”).
Why This Matters: Even correct statistical results fail if users misinterpret them. Aligning design with the question ensures rapid comprehension.
Double Machine Learning with Augmented Inverse Propensity Weighting
Purpose Estimate the causal impact of one or more “treatments” on an outcome (e.g., retention) using flexible machine learning methods. Remove confounding biases by modeling both how treatments are assigned (propensity) and how outcomes are generated.
Key Elements
Propensity Model: Estimate the probability of receiving a given treatment (or treatment level) for each unit.
Outcome Model: Estimate the outcome conditional on observed features and treatment.
Why “Double” Machine Learning
Two models are fitted: one for treatment assignment (propensity) and one for outcome.
Use cross-fitting to reduce overfitting: split the data into folds, train the propensity/outcome models on one fold, and compute predicted values on the hold-out fold. Aggregate predictions for final estimates.
Augmented Inverse Propensity Weighting (AIPW)
AIPW combines inverse-propensity-weighting (IPW) with outcome modeling.
In plain text (for a single binary treatment example): ATE = (1/N) * SUM_over_i[ {W_i / p_i} * (Y_i - m1(X_i)) - {(1 - W_i)/(1 - p_i)} * (Y_i - m0(X_i)) + (m1(X_i) - m0(X_i)) ]
Where:
W_i is treatment indicator (0 or 1).
p_i is predicted probability of treatment for unit i (from the propensity model).
m1(X_i) is predicted outcome under treatment (from the outcome model).
m0(X_i) is predicted outcome under control.
The “augmented” part comes from adding the outcome model predictions to correct errors in propensity-based weighting.
For multiple discrete treatment levels, you define separate propensity scores for each treatment level and adapt the formula accordingly. For instance, with treatments in a set {0,1,2,...}, you estimate p_i(0), p_i(1), p_i(2), etc., and predicted outcomes for each level: m0(X_i), m1(X_i), m2(X_i), etc. Then you compare each level pairwise or against a reference.
Discretizing Treatment Levels
When treatments are continuous (e.g., “hours of exposure”), create bins or intervals.
Estimate separate propensity probabilities for each bin.
Compute ATE contrasts between these bins (low vs. medium, medium vs. high, etc.)
Weight each contrast by the appropriate share of the overall population to compare treatments on the same base.
Outcome
Yields consistent estimates of average treatment effects even when the outcome and treatment assignment have complex nonlinear relationships with covariates.
Helps compare different “treatment levels” fairly without letting extreme subgroups overly influence the results.
Combining Propensity Score Re-Weighting and Iterative Proportional Fitting
Problem Surveys or other data-collection methods might suffer from two kinds of bias:
Internal Validity Issues: Treatment and control groups differ on key covariates.
External Validity Issues: The subset of responders differs from the broader population.
Propensity Score Re-Weighting for Internal Validity
Estimate the probability p_i(treatment) that unit i will end up in the treatment condition.
When analyzing the treated group vs. control respondents, assign each unit a weight = 1 / p_i for treated units, or 1 / (1 - p_i) for control units. This aligns the distribution of covariates in treated vs. control groups.
Result: balanced distributions of relevant covariates, removing confounding inside the sample of responders.
Iterative Proportional Fitting (IPF)
Also called “raking” or “post-stratification.”
Align the weighted sample to match known population margins or distributions (e.g., age, region proportions) from external data or overall user logs.
IPF iteratively adjusts the weights so that the sample frequencies in each subgroup match the target population frequencies.
Example: If 18-24 year-olds are 20% of the overall population but only 10% of your final weighted sample, IPF will increase the weight for that subgroup until they represent 20%.
Combined Effect
Propensity Score Re-Weighting ensures the control vs. treatment comparison is unbiased among those who responded.
Iterative Proportional Fitting scales up or down those respondents so that the final distribution matches the total population, addressing who did or did not respond.
Outcome
Both internal and external validity are addressed.
Resulting estimates of average treatment effects better reflect the population-level effect (external validity) without confounding (internal validity).
This approach is common in survey analysis, but the principles extend to any observational study where the eventual analyzed sample may not represent the population of interest.
Follow-Up Question 1
How do you assess whether the surrogate-based projection is reliable for new cohorts and new billing cycles?
Answer Explanation:
Evaluate accuracy on historical tests that ran longer. Compare the surrogate-based projections from the early part of those tests with the actual observed outcomes in later cycles.
Check if the surrogate variable (retention) maintains a high correlation with final outcomes. If correlation drops over time or across certain user segments, refine or re-train the retention model.
Examine differences in user behavior between the newly launched feature and past features. If you suspect differences, incorporate updated user attributes or behavior data into the surrogate model.
Follow-Up Question 2
What if you cannot find suitable control units for the synthetic control method?
Answer Explanation:
Attempt hierarchical methods where partial comparators are formed by combining multiple datasets or smaller subgroups.
Use advanced matching algorithms on historical trends to find “closest” units, even if not perfect.
If no valid control exists, consider difference-in-differences with time-series modeling or advanced causal inference frameworks (e.g., structural models) to capture counterfactual trends.
Revisit the intervention scope. If possible, stagger rollout by region or segment to create partial randomization and improve comparability.
Follow-Up Question 3
Why does the partially linear model approach sometimes yield incorrect rankings for treatments with different baselines?
Answer Explanation:
The partially linear model places higher weight on observations with unpredictable assignment, leading to biased average treatment effect estimates for heterogeneous treatments.
When two treatments have different baseline distributions, the model might emphasize sub-populations where the assignment is less predictable, distorting global comparisons.
The AIPW method more directly targets average treatment effects at each treatment level, bypassing the weighting distortions of the partially linear model.
Follow-Up Question 4
How do you handle survey completion incentives that inflate completion rates but might reduce data quality?
Answer Explanation:
Treat the incentive as a treatment factor in the survey experiment. Measure its effect on completion rate and on any data-quality guardrails (e.g., time to complete, random answering).
Segment responders by incentive level and re-weight their responses via propensity scores to match the distribution of typical non-incentivized respondents.
Validate data quality by checking correlations between baseline user behaviors and survey responses. If certain segments appear to be “speed-clicking,” remove or down-weight those records.
Follow-Up Question 5
How can you ensure that dashboard designs do not mislead stakeholders?
Answer Explanation:
Use consistent color schemes and axis scales across related charts so viewers do not over-interpret small differences.
Avoid 3D charts or distorted axis scaling.
Provide concise tooltips or labels describing precisely what each visual element means.
Conduct quick user testing with internal stakeholders. Ask them to interpret the charts. If results differ from your intentions, adjust design or labels.
This approach addresses each problem: projecting long-term effects from short tests, inferring causal impacts without perfect randomization, balancing metrics via double machine learning, mitigating survey bias, and presenting the results effectively.