ML Interview Q Series: Determining Experiment Duration: Beyond Fixed P-Values Using Power Analysis and Sequential Testing.
Browse all the Probability Interview Questions here.
2. How can you decide how long to run an experiment? What are some problems with just using a fixed p-value threshold and how do you work around them?
Deciding how long to run an experiment is deeply connected to statistical power, effect size, and the variance of the metric of interest. Stopping criteria are rarely as simple as “run until we reach a fixed p-value threshold.” When relying solely on a fixed p-value approach, certain pitfalls arise that can lead to false conclusions or suboptimal business decisions. Below is a thorough discussion of these nuances and how practitioners commonly mitigate them.
Choosing the Duration of an Experiment
One practical way to figure out how long an experiment should run is to estimate the number of samples required to detect a prespecified effect size with a desired statistical power and significance level. In typical hypothesis testing scenarios, we pick a significance levelalpha (often 0.05) and a desired power (often 0.8 or 0.9). The effect size you hope to detect is the smallest difference worth acting on. Once you have a sense of the effect size, the standard deviation of your metrics, and your significance and power requirements, you can solve for the required sample size.
n = 2 * ((z_(1 - α/2) + z_(1 - β)) / (δ / σ))^2
In real-world scenarios, the standard deviation and effect size might be partially unknown, so pilot data or historical data can help estimate them. Once you know your required samples, you can estimate how long it takes to gather enough observations, considering average traffic or user engagement. If you run an A/B test on a website with one million visits a day, you can quickly gather enough observations. If you have an internal tool with only a few hundred daily active users, you might need more days or even weeks to gather a statistically meaningful sample. Moreover, if your experiment has metrics that vary over time (for instance, behavior might differ on weekdays versus weekends), you should run the test for enough time to capture those potential temporal patterns.
Drawbacks of Using a Fixed p-value Threshold
A major issue with deciding upon a fixed p-value threshold at the outset is that real experiments rarely proceed under ideal conditions. People often peek at the p-value during the experiment and might stop early if they see significance. This repeated significance testing inflates the Type I error rate, meaning you are more likely to claim a difference when there is none. Another pitfall is that p-values do not communicate the magnitude of the difference; a small effect in a huge sample might yield a very tiny p-value that is “significant” but operationally unimportant. Conversely, a practically important effect might fail to reach significance if the sample size is too small or the experiment is not run for a sufficient length of time.
Problems like p-hacking or optional stopping (repeatedly checking the p-value, and as soon as you see significance, you stop) effectively change the distribution of the test statistic. The nominalalpha is no longer accurate in reflecting your true false-positive probability. There can also be time trends or novelty effects in user behavior, which might produce fleeting significance if you look at the wrong time or too frequently. If you only rely on a fixed threshold of 0.05, you may see an artificially inflated number of “statistically significant” findings.
Working Around These Pitfalls
One standard strategy is to define a fixed sample size in advance based on power calculations. You collect data until you hit the predetermined sample size (or duration) and then conduct a single significance test. By specifying how many observations you want ahead of time (and not peeking too often), you preserve the nominal Type I error rate. However, in many real business scenarios, it is impractical to avoid interim checks, because early stopping can be beneficial if an experiment is obviously detrimental to user experience or revenue.
To handle repeated significance checks, approaches such as alpha spending or group sequential methods can be used. Alpha spending plans allow you to partition your overall Type I error budget across multiple checks. You might say, for example, you want to spend half of your alpha budget on early checks and the remaining half on a final confirmatory test. Group sequential approaches (like Pocock’s method or O’Brien-Fleming) adjust critical p-value thresholds depending on the number of looks you plan to make. This ensures that your overall experiment-wide false positive rate remains near the nominalalpha. Another approach, known as a Bayesian framework, allows you to continuously update your posterior estimates of the difference between variants without relying solely on p-values. However, Bayesian methods come with their own complexity and require you to interpret posterior distributions and credible intervals.
Methods like sequential testing (e.g., the Sequential Probability Ratio Test) or a repeated measure that monitors the running experiment’s metrics in real time can also mitigate p-hacking by offering principled stopping rules. Power-based stopping criteria, or ensuring a minimal clinically important difference, can also help keep the experiment’s duration within practical limits while maintaining statistical rigor.
Below is a Python snippet that illustrates how you might perform a power calculation for a basic A/B test with a known baseline conversion rate, an expected minimum detectable effect, and a significance level:
import math
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
# Suppose your baseline conversion rate is 0.1 (10%)
# You expect a 1% improvement (from 0.1 to 0.11)
# So the effect size is difference between proportion 0.1 and 0.11
baseline_rate = 0.10
new_rate = 0.11
effect_size = proportion_effectsize(new_rate, baseline_rate)
# We want alpha=0.05 and power=0.8
alpha = 0.05
power = 0.80
analysis = NormalIndPower()
# Solve for sample size in each group
sample_size_per_group = analysis.solve_power(effect_size=effect_size,
alpha=alpha,
power=power,
alternative='larger')
print("Required sample size per group:", math.ceil(sample_size_per_group))
This example shows how, by specifying alpha, power, baseline rate, and the desired improvement, you can derive the sample size per group. You then multiply by two for a two-arm test to see the total needed across both variants. This approach helps you avoid artificially short experiments. It also discourages deciding on the fly by looking at p-values every few hours.
What if We Want to Stop the Experiment Early if the Results Are Clearly Significant or Clearly Bad?
In many real settings, people want a chance to intervene early if a test variant is performing very poorly. For that scenario, group sequential designs can be used. You pre-plan the number of “looks” at the data, each with a stricter threshold for significance. If you see a massive difference at an interim look, you can stop early. If not, you continue until the final planned sample size. This preserves the overall significance levelalpha.
A typical approach is the O’Brien-Fleming boundary, which sets a very stringent early boundary for detection. The threshold might be extremely low early on, allowing you to stop the experiment only if it is overwhelmingly obvious that the difference is real. At later looks, the threshold becomes less stringent. This ensures you do not inflate Type I error.
Why Do p-values Become Problematic When We Look Too Often?
If you perform multiple hypothesis tests and check the p-value after every new batch of data, you are inflating the chance of observing at least one random fluctuation that meets the p < 0.05 threshold. The probability of at least one false positive over many repeated checks can become much larger than 0.05. A naive approach that stops as soon as p < 0.05 is reached will systematically bias your experiment and lead to a large fraction of false discoveries.
One practical approach to mitigate this is alpha spending. Suppose you plan a total of, say, five interim looks. You decide how to “spend” your alpha budget of 0.05 across these looks. If you reach significance early, you stop and declare a difference; if not, you continue until you have used your entire alpha budget or you reach your final look. This ensures the overall false-positive probability across all these checks stays around 0.05.
What If the Observed Effect Size Differs from Our Assumptions?
Sample size calculations necessarily rely on estimates of the standard deviation and effect size. If you overestimate the effect size, you might end up with insufficient power for the actual (smaller) effect. If you underestimate the standard deviation, you might also be underpowered. Conversely, if you overestimate the variance or the effect size is bigger than assumed, you may reach significance more quickly. The best practice is to rely on historical data or pilot tests to refine these assumptions. If you discover during the experiment that your assumptions were drastically off, you might need to recast your hypothesis or rerun the power calculation.
How Do We Account for Type II Error and Power?
Many beginners focus primarily on p-values and Type I error (false positives), but failing to detect a real effect (Type II error) is equally critical. That’s why deciding how long to run an experiment (i.e., how many samples you collect) depends so heavily on the effect size you want to detect and the power you require. If you do not want to miss an improvement that would have a sizable business impact, you need to gather enough data. Underpowered experiments lead to inconclusive or misleading results: you risk shipping beneficial features late or discarding good ideas prematurely.
Power calculations before running the experiment help ensure you can detect the effect you care about within a confidence margin. You can also do a sensitivity analysis: for example, if the effect size is smaller than your threshold, maybe you are okay with missing that effect, because it is not worth the engineering or user disruption cost. This shapes the minimal clinically or practically meaningful difference you aim for.
Could a Bayesian Approach Solve These Issues?
A Bayesian approach shifts the framing from “Is the p-value < 0.05?” to “What does the posterior distribution of the difference between the two variants look like?” One can continuously update the posterior as data arrives and potentially stop when the posterior probability that one variant is better than the other crosses a certain threshold. This can be more intuitive for business stakeholders, because you can say something like, “We are 95% sure that the new variant is at least 0.5% better than the old one.” However, Bayesian methods require careful choice of priors, credible interval thresholds, and an understanding that these thresholds serve a role similar to alpha in a frequentist test. Moreover, if you continuously look at the posterior, you still need guidelines for when to stop. This might take the form of a region of practical equivalence or a certain posterior probability boundary.
Follow-up Question: How Should We Think About Time-based Variations During the Experiment?
Sometimes user behavior changes over days of the week or from one month to another. You might run into a scenario where the difference is significant one week but not the next. A recommended practice is to run the experiment over complete “blocks” of time that represent cyclical patterns, such as ensuring you have at least one complete weekend cycle for each experimental arm. If your product usage is highly seasonal, you might need to account for that or run the experiment over multiple relevant weeks or months. A more advanced approach is “blocking” or “stratified randomization,” where you randomize within relevant demographic or time blocks to reduce variance and ensure each variant sees a fair share of different times or user segments.
Follow-up Question: Is There a Danger in Using a Running Average or Real-time Visualization to Decide When to Stop?
Teams often watch their experiment metrics in real time. That is not inherently bad, as it is important to ensure you do not harm users or degrade key performance indicators. The danger is drawing premature conclusions. If you see the running average for the new variant is trending upward initially, you might be tempted to declare victory early. To mitigate that, you can use a group sequential design with alpha spending so that your repeated looks are statistically valid. Alternatively, you can maintain real-time monitoring strictly for safety checks, but only do the formal significance test at the predetermined end or at scheduled interim analyses.
Follow-up Question: What is the Difference Between Hypothesis Testing with a p-value vs. Confidence Intervals?
Confidence intervals communicate the range of plausible values for the difference between the control and treatment. If a 95% confidence interval for the difference does not include zero, it corresponds to p < 0.05 in a two-sided test. However, intervals are often more interpretable for business stakeholders, as you can say something like, “We estimate the difference in the conversion rate is between 0.8% and 1.3%.” You can track how that confidence interval evolves over time. But you still need to be cautious about repeated looks, as your intervals can also be biased if you stop the experiment as soon as the interval excludes zero.
Follow-up Question: Could We Use Non-parametric Methods If We Are Unsure of the Distribution?
If your data are not normally distributed or you suspect heavy tails, you can use non-parametric tests such as the Mann-Whitney U test (also called the Wilcoxon rank-sum test) for comparing two independent samples. The same considerations about sample size, effect size, repeated looks, and alpha inflation still apply. Power analysis can be more involved, but modern statistical software packages include procedures for non-parametric power calculations too. You also might transform the data or use robust statistical methods. The fundamental principle remains that you should plan how many observations to collect, how frequently to look at your results, and how to avoid p-hacking.
Follow-up Question: How Do We Communicate Results to Stakeholders Who Only Know About p-values?
Translating test outcomes to business stakeholders can be done by focusing on metrics like the estimated lift, confidence intervals, and potential business impact. You can explain that a “significant p-value at 0.05” means that if there were truly no difference, data this extreme or more so would only occur about 5% of the time in repeated experiments. Stakeholders often want to know the likely improvement in revenue, conversion, or user satisfaction, rather than just a p-value. It’s important to mention that peeking early can produce false positives. Explaining alpha spending and the importance of a well-powered test can help them appreciate why you have to wait the full run period or follow a well-structured stopping rule.
Follow-up Question: Could We Use a Multi-armed Bandit Instead?
Multi-armed bandits shift the question from purely offline hypothesis testing to online learning of which variant is better. If you want to adaptively allocate traffic to better-performing variants, a bandit method is ideal. However, bandit methods typically do not provide as direct a measure of p-value or confidence intervals in the classical sense. They aim at maximizing cumulative reward rather than a once-and-done significance statement. If your ultimate goal is to identify the best variation while minimizing regret, bandits can be a good choice. If you need a confirmatory test or want to do standard hypothesis testing, a bandit approach may not be as straightforward, although you can design confidence-based bandit approaches or incorporate Bayesian bandits.
In conclusion, you should generally pre-plan the sample size or experiment duration based on power calculations and effect sizes. You need to be mindful of how many times you peek at the data and adopt alpha spending or group sequential methods if you must stop early. Relying on a single fixed p-value threshold without pre-specifications leads to inflated Type I error and can mislead decision-making. A well-structured design that either uses classical hypothesis testing with careful alpha control or a Bayesian framework with a clear stopping rule is the best way to ensure accurate conclusions from your experiments.
Below are additional follow-up questions
How do you handle experiments where multiple metrics need to be tested simultaneously?
When running an A/B test (or any controlled experiment), it’s common to track more than one metric. For instance, you might look at conversion rate, average order value, and user engagement time. Each of these metrics may serve a different business goal, and the effect of a new feature could differ across them.
One pitfall is that if you run statistical tests on multiple metrics independently using the same alpha level (for example, 0.05), you inflate the probability of at least one false positive. This is known as the multiple comparisons problem. Even if each test individually has a 5% chance of a Type I error, the chance that at least one returns a false positive is greater than 5% when you consider multiple tests together.
To work around this, one strategy is to apply a correction procedure such as the Bonferroni correction or a more powerful method like the Holm-Bonferroni or Benjamini-Hochberg procedure. These methods adjust your p-value threshold or your confidence intervals so that the overall false positive rate remains close to the nominal alpha. For example, if you have three metrics and you use a Bonferroni correction with alpha = 0.05, you would test each metric at alpha = 0.05 / 3 = 0.0167. This ensures that the family-wide Type I error rate remains near 0.05 across the three tests.
Another approach is to designate a single primary metric, which is the primary outcome you care about most. You apply the standard alpha threshold (e.g., 0.05) for that primary metric. Other metrics can be considered secondary, and you might apply more exploratory or descriptive thresholds or correct for multiple comparisons to get a sense of how the feature performs along other dimensions.
Real-world pitfalls and edge cases:
If you see conflicting results (for example, the experiment improves one metric significantly but another metric significantly worsens), there can be confusion on what to do next. In such cases, you need to decide which metric is most critical or whether you can accept a negative trade-off in a secondary metric.
The correlation structure among the metrics can complicate your interpretation. If the metrics are highly correlated (e.g., two ways of measuring conversion), the classical Bonferroni correction might be too conservative. More sophisticated corrections or multivariate testing approaches might be warranted.
Sometimes your secondary metrics will serve as guardrail metrics that should not worsen beyond an acceptable threshold. For example, you might allow a new feature to degrade performance by up to 2% on some secondary metric. If it degrades beyond that, you call the experiment unsuccessful even if your primary metric is positive.
In practice, the best approach is to:
Decide on a single primary metric.
Carefully consider which metrics are purely exploratory or “nice to have” and which are critical guardrails.
Use corrections for multiple comparisons if you will make decisions from multiple p-values.
How do you design experiments when your metric of interest is extremely volatile?
Some metrics, such as revenue or time-on-site, can have a heavy-tailed distribution, where a small fraction of users account for a large share of the total. This can make the variance of your metric large and the distribution highly skewed, which complicates classical power calculations and standard parametric tests.
Potential pitfalls:
Outliers can dominate your analysis. You might see a few users who purchase in huge quantities or spend hours on the site. This can inflate standard deviations and might require an impractically large sample to detect the effect you want.
The naive use of the standard two-sample t-test might be inappropriate if the normality assumptions are severely violated. Even with the central limit theorem eventually applying, you may need an extremely large sample size before the distribution becomes approximately normal.
Ways to deal with this:
Consider a non-parametric test like the Mann-Whitney U test (also known as the Wilcoxon rank-sum test), which relies less on assumptions of normality. However, interpreting the results can be less straightforward, and the effect size measure is not directly about the mean difference.
Winsorize or trim your data. You might cap the top 1% of extreme values to reduce the variance. This must be done carefully, as it can distort the interpretation if high-value users are truly part of your target population.
Use metrics that reduce variance, such as log-transforming revenue. Log transformations can help handle multiplicative effects and reduce the influence of extremely large values.
Employ robust statistical methods or specialized parametric distributions (e.g., heavy-tailed distributions like the Pareto or lognormal). If you use a Bayesian approach, choose priors that better capture heavy-tailed data.
Consider median-based comparisons if you care about typical users rather than outliers. However, a median test might miss improvements that occur mostly in the high-spending minority.
Edge cases to watch for:
If your user base is highly segmented (for instance, enterprise users vs. individual consumers), a single heavy-spending enterprise user in the test group could skew results. Stratifying or segmenting could help you analyze subsets of users separately.
You may find yourself collecting data for a very long time if your effect is subtle and overshadowed by outlier noise. Re-check your experimental design to ensure your effect size is realistically detectable given the variance.
How do you handle experiments where the control or baseline changes over time?
In some cases, the “control” variant is not static but is itself evolving. This can happen if your baseline system is frequently updated for other product reasons, or if user behavior drifts over time. Traditional A/B test assumptions presume that the control is stable during the experiment.
Potential pitfalls:
If the control changes in the middle of the experiment, it’s effectively a new experiment. Your original baseline assumptions (for example, around baseline metrics or variance) may no longer hold.
The results you get might partially conflate improvements made to the control variant with changes in the test variant, making it difficult to interpret the net effect of the new feature you are testing.
Strategies to mitigate:
Freeze the control environment for the duration of the experiment if possible. This ensures the only significant difference between the test and control is the feature being tested.
If you must update the control side, treat it as a new experiment period. You can break your experiment into phases (Control vs. Test in Phase 1, then Control vs. Test in Phase 2 after the update) and analyze each phase separately.
Keep thorough logs of all changes deployed to the control environment. If an emergency fix is unavoidable, record it meticulously to ensure you can interpret the data properly.
Edge cases to consider:
Unexpected external events like major market changes or holidays that cause a shift in user behavior. This can appear like a “change in the control,” although it is external. Blocking by time or using a difference-in-differences approach can help separate a global time effect from the effect of your specific feature.
If you are running continuous deployment where small updates happen daily, consider a more advanced approach such as a multi-armed bandit or a short, repeated testing cycle. Alternatively, do not run multi-week experiments that are overshadowed by many rolling changes to the system.
How do you analyze the results if user assignment was not truly random?
Randomization is the bedrock of controlled experiments. However, in real-world settings, your assignment might inadvertently be non-random due to technical glitches, user self-selection, or certain constraints in your data pipeline. If your assignment is not truly random, the standard assumptions for hypothesis testing do not hold and your p-values can be misleading.
Pitfalls:
If certain segments of users (e.g., advanced users or specific geographic regions) disproportionately end up in one treatment group, the observed difference might reflect these segment differences rather than the feature effect.
If partial rollouts or gating features cause new sign-ups to land in the treatment more often than returning users, your test is confounded by user tenure differences.
Approaches to mitigate:
Conduct thorough checks to confirm the randomization process. For instance, compare user demographics or pre-experiment behavior in the control vs. treatment group. If they differ significantly, your randomization might be broken.
If you discover biases, attempt to re-weight or re-match data post-hoc. One example is propensity score matching, in which you model the probability of being assigned to treatment based on user characteristics. You then match users across groups to reduce the imbalance. This is not as good as truly random assignment, but it can partially correct for known biases.
Segment your analysis. If you find that the test group has proportionally more new users, analyze new vs. returning users separately to isolate the effect.
Fix the assignment process and re-run the experiment if that is a viable option. This might be necessary if the bias is large and unstoppable.
Subtle real-world issues:
Your user ID generation might be flawed, causing collisions or skew. Some systems might inadvertently bucket certain user IDs into the same variant.
If a feature is discoverable only by power users, those users might gravitate to it even if you “randomly” assign them. This effectively introduces self-selection. A solution is to forcibly push the new variant to users in the treatment group in a way that does not rely on them opting in.
What if the experiment’s impact is delayed?
In many cases, you won’t see an immediate effect of your change. Perhaps you release a new feature that improves user retention over a span of weeks or months, or you change the onboarding flow that only affects new sign-ups. If your primary metric is something that manifests slowly (like long-term retention or lifetime value), a typical short A/B test might not capture the full effect.
Pitfalls:
If you only measure immediate conversion events, you might underestimate the benefit (or harm) of a feature whose main impact surfaces later.
If you decide to run the experiment long enough to see the effect, you might run into user and environment changes that confound your results (control environment changes, seasonality, competitor actions).
Ways to approach it:
Collect data over the entire user lifecycle relevant to the change. For instance, if the feature primarily affects new users, you might track each new cohort for a sufficient period.
Use delayed feedback models that attribute future events back to the original assignment at sign-up.
If your key metric is something like “retention at day 30,” you must ensure your experiment collects enough new users and then waits at least 30 days (possibly more) to measure that retention outcome.
Sometimes you can use leading indicators (short-term proxies) that correlate strongly with the eventual long-term outcome. For instance, if you know that 80% of users who come back for a second session become weekly active users, then second-session rate might be a leading metric you can measure sooner.
Edge cases:
If you run the test for several months, there is a higher risk of changes in the product or external environment.
You might have partial data for some cohorts (e.g., a user who joined near the end of the experiment window). You need to decide whether to wait until they reach day 30 or drop them from the analysis. This can create left-truncation or right-censoring issues in your data.
How do you interpret experiments when there is a learning or novelty effect?
When a new feature is introduced, users might initially be curious and engage with it more than they will in the long run. Conversely, some features require a learning curve before users can reap their benefits. These novelty or learning effects can distort your measurements.
Pitfalls:
You might see a strong positive spike in the early days of the experiment that diminishes over time. If you run a short test, you might incorrectly conclude that the feature is a big success, only to see metrics revert to baseline later.
A new interface might initially annoy users, leading to lower engagement, but over a longer period, they might adapt and find it beneficial.
Mitigations:
Monitor your metric’s trajectory over time to see if it is trending up or down. Look at day-by-day or week-by-week breakdowns rather than just an aggregate.
If the effect is purely novelty-driven, you might see an initial spike that flattens. Consider extending the experiment until the metric stabilizes.
You can compare new vs. returning users to see if the feature’s effect depends on prior user familiarity. A feature that benefits novices might have a more lasting effect among brand-new users.
If you suspect a learning curve, you might run user training or tooltips to help them adopt the new feature. The eventual success might hinge on how well you guide them.
Edge cases:
Seasonal or marketing events might coincide with your launch, artificially boosting usage overall. This can mask or confound novelty effects.
Certain user segments (advanced vs. casual) might adopt the feature differently. Understanding that segmentation can help you see if the novelty effect is universal or restricted to a certain segment.
How do you handle continuity and rollback after concluding the experiment?
Once you have run an experiment for the planned duration, you typically decide whether to ship the new feature to all users or revert entirely. However, in some scenarios, the experiment might show a mixed outcome, or the difference might be significant but the effect size is smaller than you hoped.
Pitfalls:
If you gradually roll out the feature to all users after seeing a promising result, it’s no longer a strict A/B environment. Any subsequent changes might be confounded by the newly introduced feature.
If the experiment is inconclusive, you might be tempted to run it longer or repeatedly. But repeated experiments without changes can degrade user trust or cause repeated user churn.
Strategies:
If the result is clearly positive and meets your success criteria, adopt the new feature with a planned rollout schedule. Monitor key metrics during the rollout to the full user base to ensure no unanticipated side effects occur.
If it is marginally positive but you see potential, you might do an internal or partial rollout for advanced users or employees, gathering more feedback qualitatively.
If the test indicates a negative or neutral effect, strongly consider rolling back. However, if you suspect the test was underpowered or external factors skewed the results, you might plan a new experiment with improved design or a different timeframe.
Have a plan to preserve or archive the data. The experiment logs can serve future analysis, especially if you revisit the feature idea later.
Edge cases:
The feature might be beneficial for certain subsets of users but harmful for others. You might do a “targeted rollout” or personalization approach, rolling it out only to the users who see a net positive. This requires careful segmentation analysis to ensure you do not inadvertently exclude large groups or hamper fairness.
If you discovered that the test variant had certain side effects (e.g., more user support tickets), weigh that operational cost against the benefit on your main metrics.
How do you handle experiments in systems where metrics are updated in near real-time (e.g., streaming data systems)?
In fast-moving data environments (e.g., certain ad-tech platforms or real-time recommender systems), you might collect metrics continuously and potentially react in real time to signals from the experiment. Traditional “collect all data, analyze at the end” may be less relevant when decisions happen moment by moment.
Potential pitfalls:
Real-time adaptation can cause non-stationarity in the user population. For instance, if the system starts favoring the better variant more aggressively, you lose the strict randomization that underpins your standard test.
The environment might shift during the test, or competing product changes might happen simultaneously.
Best practices:
If you need real-time adaptation, consider multi-armed bandit algorithms or Bayesian adaptive experiments. These are designed to allocate more traffic to better-performing arms over time while still maintaining some exploration.
If you want a final “statistical test” conclusion, you might keep a small portion of traffic randomized to each arm consistently for a fixed period to preserve a clean comparison group. Meanwhile, the rest of the traffic is adaptively allocated in real time.
Keep a stable logging mechanism and identify whether data is missing or delayed in the streaming pipeline. In some real-time systems, partial data might be processed out of order or not at all, which complicates measurement.
Subtle real-world issues:
In streaming ad systems, a single user might generate events many times a day, so your user-level randomization must be consistently enforced across those repeated interactions to maintain the validity of the test.
If the real-time system modifies bids or recommendations based on the performance observed, you effectively have a feedback loop. This can lead to scenario drift, where each group sees different segments of traffic over time due to the system’s adaptation.
How do you interpret results when some participants encounter technical errors that prevent them from experiencing the test properly?
In large-scale online experiments, there can be technology failures where some fraction of the “treatment” group never actually sees the intended treatment (perhaps due to front-end JavaScript errors or partial outages). That means your “treatment” group is effectively a mixture of participants who got the new experience and participants who remained on something akin to the control.
Risks:
Your measured effect size might be underestimated because not everyone is actually treated.
If the errors are not random and systematically affect certain user types (for example, certain browsers, network speeds, or countries), this introduces bias.
Mitigations:
Conduct “intent-to-treat” analysis, which compares everyone assigned to treatment vs. everyone assigned to control, regardless of whether they actually received the treatment. This preserves randomization but might dilute the measured effect if many treatment assignments failed.
Also conduct a “treatment-on-the-treated” analysis by filtering out users who did not receive the feature, but be aware that this filtering might break the randomization assumption if there is a systematic reason they did not receive it.
Track the fraction of users in the treatment group who actually experience the new feature. If that fraction is too low, address the root causes of these technical failures first before concluding the experiment’s effect is minimal.
Edge cases:
A small, random glitch might be acceptable if it impacts only a tiny percentage of users similarly across control and treatment. But if it disproportionately affects treatment, your results are confounded.
If the glitch is widespread, it may be better to fix the error, re-randomize users, and restart the experiment to ensure a valid measurement.
How do you set the correct alpha level and confidence interval coverage in extremely large-scale experiments?
In massive-scale experiments, even tiny differences can become statistically significant if you have enough data. Consequently, using a standard alpha of 0.05 might result in declaring many small but operationally insignificant differences as “significant.”
Challenges:
You risk shipping changes that are “statistically significant” but have a negligible business impact. Over time, this can clutter your product with minor changes that add complexity without real value.
Conversely, you might repeatedly declare significance on minuscule effects, leading to a high false discovery rate overall if you are running many parallel experiments daily.
Strategies:
Lower your alpha threshold or consider the practical significance. For example, you might say you need at least a 0.1% absolute lift that is significant at alpha = 0.01 to be actionable.
Use confidence intervals and effect size estimates to check whether the observed difference is truly meaningful. If your 95% confidence interval is (0.05%, 0.08%), maybe it is statistically significant, but the effect might be too small to justify engineering resources.
Keep track of how many experiments you run per unit time. If you run dozens or hundreds of experiments in parallel, you need to control the overall false discovery rate across them. A method like the Benjamini-Hochberg correction can help manage many p-values simultaneously.
Edge cases:
A small effect might accumulate large business value if you have a massive user base. For example, a 0.1% improvement in a multi-billion-dollar operation is still substantial. So even seemingly tiny lifts can matter if they impact a huge scale.
If you are uncertain about the minimal meaningful difference, consult with stakeholders on cost-benefit analyses. Sometimes it is worth acting on a tiny improvement if the cost is low and the user experience remains clean.