ML Interview Q Series: How would you design an A/B test and use bootstrap sampling to get confidence intervals for conversion rates?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
An A/B test is typically designed by dividing the user traffic into two groups: one that sees version A (control) and another that sees version B (treatment). The primary metric of interest here is the conversion rate, defined as the proportion of visitors who complete a desired action, such as making a purchase.
In practical terms, each user is randomly assigned to either version A or version B to ensure there is no systematic bias. After collecting enough data in both groups (number of visits and number of conversions), statistical methods are used to compare the two observed conversion rates. The test’s conclusion is whether the difference in these rates is statistically significant.
When performing this analysis, it is critical to ensure an adequate sample size for each variant so that the test has enough power to detect meaningful differences. Common practices include defining a significance level (often 0.05) and a desired power (often 0.8). The analysis then typically measures whether the difference in conversion rates is large enough, relative to its variability, to conclude that one version outperforms the other beyond random chance.
Bootstrapping can further aid in assessing the robustness of these results by constructing confidence intervals for the conversion rates (and for their difference) through repeated sampling from the empirical data. The bootstrap approach sidesteps assumptions about the data’s underlying distribution and allows a more direct estimation of the variability in the observed metric.
The Core Mathematical Formulas for Conversion Rates
To analyze conversion rates in an A/B test, denote the observed conversion rate in variant A as p_A and in variant B as p_B. Each can be computed as the ratio of the number of conversions to the total number of visitors in that variant.
In these expressions, Conversions in A and B refer to the counts of users who converted under each condition, and Total visitors in A and B refer to the total users exposed to each condition.
Setting Up the A/B Test
You randomly assign incoming users to either variant A (control) or variant B (treatment). The crucial steps include:
Collecting data for each group, including the total number of visitors and how many of them converted. Monitoring the test duration to ensure the sample size is large enough to meet statistical power requirements. Ensuring consistent assignment so that each user sees only one version throughout the test, preventing contamination of data.
Once enough data is accumulated, you can estimate p_A and p_B, compute their difference, and perform hypothesis testing (for instance, a two-proportion z-test) or generate a bootstrap distribution to see if there is a significant difference between them.
Bootstrap Sampling for Confidence Intervals
Bootstrapping is a resampling technique that draws repeated samples (with replacement) from the observed data to approximate the sampling distribution of a statistic. For conversion rates, we typically do the following:
Resample with replacement from the original data. For variant A, draw from the set of visitors (labeled as converted or not). For variant B, do the same from its set. In each bootstrap sample, calculate p_A, p_B, and their difference. Repeat this process many times (often 10,000 or more). Use the distribution of these bootstrapped statistics to find percentiles corresponding to the desired confidence interval (for example, 2.5th and 97.5th for a 95% interval).
By observing how the difference in p_B - p_A varies across all bootstrapped samples, you can see whether zero difference is contained within that interval or not. If zero difference is outside the interval, it suggests a significant advantage for one version over the other.
Here is a simple Python demonstration for constructing confidence intervals using bootstrap for a conversion rate. Assume you have a list of binary outcomes for each visitor in A (1 if converted, 0 if not) and similarly for B.
import numpy as np
def bootstrap_confidence_interval(data, num_bootstraps=10000, alpha=0.05):
# data is an array of 0s and 1s
estimates = []
n = len(data)
for _ in range(num_bootstraps):
sample_indices = np.random.randint(0, n, n)
sample = data[sample_indices]
estimates.append(sample.mean())
lower_bound = np.percentile(estimates, 100 * (alpha / 2))
upper_bound = np.percentile(estimates, 100 * (1 - alpha / 2))
return lower_bound, upper_bound
# Example usage:
# Suppose A_outcomes and B_outcomes are arrays of 0s/1s for each variant
# Conversions in A / B are the sum of these arrays,
# total visitors in A / B are the lengths of these arrays.
A_mean = np.mean(A_outcomes)
B_mean = np.mean(B_outcomes)
diff = B_mean - A_mean
A_lower, A_upper = bootstrap_confidence_interval(A_outcomes)
B_lower, B_upper = bootstrap_confidence_interval(B_outcomes)
# For the difference, you'd combine the approach by resampling from each dataset
# or by resampling from the combined data under an appropriate assumption.
Follow-Up Questions
What are the assumptions behind a two-proportion z-test, and how do they relate to a bootstrap approach?
The two-proportion z-test relies on a normal approximation of the underlying distribution of conversion counts, assuming a large enough sample size. It also assumes independence of observations. The bootstrap approach does not rely on a normal approximation; it uses empirical resampling to approximate the distribution of the statistic. In small sample regimes or where normality assumptions are questionable, bootstrap can provide more robust estimates.
How do you ensure the A/B test is powered correctly?
Determining the required sample size involves specifying the effect size you want to detect (i.e., how large a difference in conversion rates is practically important), the desired significance level, and the test’s power. Typically, you use standard sample size formulas or simulation-based methods to estimate how many users are needed in each group to reliably detect the desired difference if it truly exists.
How might you handle situations where multiple versions (more than two) are tested simultaneously?
When you have multiple variants, it becomes a multi-armed bandit or multi-way test. If you simply test them in a naive pairwise manner, you must adjust for multiple comparisons to avoid inflating the false positive rate. Techniques like the Bonferroni correction, Tukey’s range test, or false discovery rate procedures might be used. Alternatively, Bayesian multi-armed bandit approaches dynamically allocate traffic to better-performing variants while still exploring others.
How do you address issues of stopping an A/B test early if one version appears to be winning?
Stopping tests prematurely when a difference seems significant can inflate type I errors. Sequential testing methods or group-sequential designs allow interim checks while controlling the overall type I error. Techniques like alpha-spending functions can adjust p-values to account for these multiple peeks at the data.
How would you incorporate business metrics beyond conversion rate?
Although conversion rate is often key, you may need to track additional metrics like average order value, customer lifetime value, or user engagement times. The analysis can become more complex because improvements in conversion might come at the cost of other metrics. In practice, you define a primary metric (the main success criterion) but also monitor secondary metrics to avoid unintended consequences.
How do you interpret a confidence interval that includes zero?
If the bootstrap or other intervals for the difference in conversion rate cross zero, you cannot confidently declare a statistically meaningful increase or decrease in conversion. It suggests that the data does not provide strong evidence that one version outperforms the other beyond chance fluctuations.
These clarifications help ensure that not only the statistical methodology is correct, but that real-world complexities in user behavior, business considerations, and data collection are fully addressed when running and interpreting an A/B test.
Below are additional follow-up questions
How would you handle traffic fluctuations or seasonal effects during the A/B test?
When running an A/B test, traffic volume and user behavior can fluctuate significantly over time. Seasonality (such as holidays, weekends, or special events) might skew results if one variant is disproportionately affected by these patterns.
You could run the test for a full cycle covering all relevant time periods. If the test only spans a brief snapshot, day-to-day or week-to-week variability may bias the outcome. One practice is to randomize assignments daily or weekly and collect data across multiple cycles of traffic to capture typical user behavior patterns.
Edge cases include:
Sudden traffic spikes from promotions or media coverage that disproportionately affect one variant.
Holidays or end-of-year events that drastically change user spending behavior.
To address these, you can track user behavior data across seasonal periods, segment data by time windows, and verify that neither variant is exclusively exposed to abnormal traffic patterns. If a traffic spike occurs, you might analyze those days separately to see if the behavior under that spike is different from the normal baseline.
How do you deal with delayed conversions or a long funnel where the final outcome is not immediate?
Sometimes users do not convert instantly. They might visit the site, leave, return later, and only then make a purchase. This delayed conversion complicates measuring variant performance because short test windows may miss these later conversions.
One strategy is to wait enough time after the last user’s initial exposure so that most conversions can be tracked. You define a “cool-down” period to let conversions unfold naturally. Another approach is to track intermediate metrics (e.g., add-to-cart rates) if the final conversions occur too far in the future.
Pitfalls:
Incomplete data if you stop measuring too soon and some portion of your sample has not had time to convert.
Attribution issues when multiple sessions or marketing channels interplay with your test.
In practice, you might incorporate a tracking system that ties each visitor to their future conversions. You then ensure your final dataset captures that entire timeline, or you create partial metrics that remain consistent across both variants.
How would you respond if the bootstrap confidence interval shows a different conclusion than a standard z-test?
It can happen that a parametric test (two-proportion z-test) and a nonparametric or resampling-based approach (bootstrap) yield slightly different p-values or confidence intervals. If the sample size is not large, or if the data distribution deviates from normality assumptions, the bootstrap method often gives a more reliable representation of the variance.
In such a scenario:
Investigate the assumptions made by the z-test, such as normal approximation or equal variance assumptions.
Check sample size to see if it meets the typical thresholds for the normal approximation to hold.
Use additional methods (like a Bayesian approach or a permutation test) for a robustness check.
Ultimately, if the bootstrap approach consistently indicates a different conclusion, you might give it more weight, especially when the data distribution is skewed or the sample size is borderline for normal approximations.
How would you incorporate user segmentation into your analysis?
Sometimes different segments of users (e.g., new vs. returning customers, different geographic regions, distinct marketing channels) respond differently to each variant. You might see a globally insignificant difference, but find that in specific high-value segments there is a significant improvement.
After obtaining overall results, you can split the data by relevant user attributes and re-compute metrics:
Segment by location: p_A and p_B might differ by country or region.
Segment by traffic source: conversions might differ between users who arrive organically versus paid campaigns.
Potential pitfalls:
Multiple hypothesis testing across many segments can inflate false positives. A stringent correction (e.g., Bonferroni or false discovery rate methods) or Bayesian hierarchical modeling can help manage these repeated checks.
Small sample sizes in certain segments leading to unreliable inferences.
How do you manage “novelty effects” or changes in user behavior over time?
Users initially exposed to a new design might behave differently simply because it is new or visually striking. This is known as a novelty effect. Over time, as novelty wears off, the conversion rate might settle at a different level.
Practical countermeasures:
Longer testing windows to allow the effect of novelty to subside.
Monitoring conversion trends across the entire test duration rather than only a single aggregated metric.
In some cases, you may see an initial spike in conversion for variant B that levels off after users get accustomed to the change. Investigating the trend by day or week and ensuring the test runs long enough to capture post-novelty behavior is key.
How do you ensure consistent user assignment to control or treatment if users revisit the site?
Randomization usually happens at the user level: a unique identifier (cookie or user account) is assigned to control or treatment, and the user remains in that group on subsequent visits. If users switch devices or clear cookies, they might be misassigned on a future visit, which can dilute the observed effect and introduce bias.
Typical practices:
Server-side assignment linked to user login or a robust user ID that persists across sessions and devices.
Cookie synchronization with fallback methods for consistent user recognition.
Edge cases:
Logged-out users who frequently clear cookies or use privacy modes.
Enterprise environments where multiple users may share the same IP or device.
Losing track of these returning users can lead to contamination across variants. The final analysis might then measure less of the true effect. Therefore, robust user tracking infrastructure is crucial for accurate assignment and measurement.
How would you approach an A/B test when the core metric is not a simple binary conversion?
In many scenarios, metrics are more complex than a binary outcome. Examples include revenue per visitor, time on site, or number of items in a basket. These metrics can be highly skewed (a few users making very large purchases).
Some considerations:
Log transformation for highly skewed continuous metrics (like revenue) if you use parametric tests.
Nonparametric or bootstrap methods that do not assume a normal distribution.
Quantile-based measures if you are specifically interested in medians or certain quantiles (e.g., top 10% of spenders).
With these metrics, you might rely heavily on resampling (bootstrapping) or distribution-free tests (like Mann-Whitney U) to estimate significance without incorrectly assuming normality.
How do you validate your testing framework with an A/A test?
An A/A test is when you present the same version (or two identical versions) to all users but label them as different variants (A and A'). If your assignment and measurement systems are correct, you expect no significant difference in conversion rates.
To conduct an A/A test:
Split traffic into two identical experiences.
Track conversion rates and see if any statistical method flags a difference where none should exist.
Potential pitfalls:
A biased randomization might lead to an artificial difference.
Technical issues in event logging might cause counting inconsistencies between the two groups.
By examining the distribution of your outcome metrics in this scenario, you can detect fundamental flaws in data capture, randomization, or analysis pipelines.
How do you decide whether to rerun a test if results are inconclusive?
Sometimes, the difference in conversion rates is not statistically significant, or the confidence interval includes zero. You might wonder if it’s due to insufficient sample size or a genuinely negligible effect.
Decision factors:
Power analysis: Evaluate whether you achieved the intended sample size and power. If the test was underpowered, it might be reasonable to extend or rerun.
Data quality: Verify there were no data collection errors, misassigned users, or external events interfering with results.
Business needs: Even if results are statistically inconclusive, you might adopt the simpler or cheaper variant if performance is roughly the same.
A possible approach is to gather more data or refine the hypothesis (e.g., redesign a different aspect of the product) before rerunning. Repeatedly testing the same small difference can lead to “p-hacking” if not handled carefully.
How do you incorporate upstream or downstream events influenced by your test?
An A/B test can have indirect consequences. For example, a user might convert at a higher rate, but subsequently request more refunds or contact customer support more frequently. You must examine not only the immediate conversion but also the later user journey.
Practical steps:
Define both a primary metric (immediate conversion) and secondary or downstream metrics (refund rate, churn, long-term revenue).
Compare these metrics for both variants, looking for consistent or conflicting effects.
Pitfalls:
A design might boost short-term conversions at the expense of post-purchase dissatisfaction.
Combining short-term gains with negative long-term impact can mislead decision-making.
Studying the entire user lifecycle and collecting enough longitudinal data ensures you capture the full picture of the test’s impact.