ML Interview Q Series: Robust A/B Testing with Mann-Whitney U for Small, Non-Normal Datasets.
Browse all the Probability Interview Questions here.
Question Imagine you have immense data available for experiments in the Uber Rider context, but for Uber Fleet, the experimental data volume is relatively small. You run an AB test for Uber Fleet and realize the data distribution is not normal. Which form of analysis would you use in that situation, and how would you decide which variant is superior?
Comprehensive Explanation
When the sample size is small and the underlying data distribution is not normal, a nonparametric approach becomes crucial. Parametric tests like the two-sample t-test rely on assumptions of normality and equal variances, which can lead to misleading results if those conditions are not met. By contrast, nonparametric methods do not require strict distributional assumptions, making them well-suited for smaller datasets where normality is questionable.
A common choice here is the Mann-Whitney U test (also called the Wilcoxon rank-sum test when dealing with two independent samples). This test assesses whether one of the two samples tends to have larger values than the other by considering the rank ordering of the combined datasets, rather than the raw data values themselves. Because it uses ranks, it is more robust to skewed distributions and outliers.
Where R_{1}
is the sum of ranks assigned to the data points in sample 1 after the values from both samples are combined and ranked. n_{1}
is the sample size of the first group. The test statistic U
is used to determine the probability (p-value) that any difference between two samples could have come from random chance under the null hypothesis (which states there is no difference between the two distributions).
To measure which variant won, one typically investigates:
The test statistic (U-statistic)
The associated p-value
Effect size and direction (whether A generally has higher or lower ranks than B)
A lower p-value indicates we can reject the null hypothesis of no difference. Beyond significance, we also want to look at magnitude (i.e., practical significance), often by examining the median of each group or computing an effect size measure like rank-biserial correlation. In AB tests, practical significance is key to deciding if one version truly outperforms the other in a meaningful way.
Practical steps to implement the Mann-Whitney U test in Python:
import numpy as np
from scipy.stats import mannwhitneyu
# Example data for two variants
variantA = np.array([12, 15, 14, 16, 20, 22, 22, 18])
variantB = np.array([10, 8, 9, 11, 13, 15, 14, 10])
u_stat, p_value = mannwhitneyu(variantA, variantB, alternative='two-sided')
print("Mann-Whitney U Statistic:", u_stat)
print("p-value:", p_value)
# Compare median or rank distribution to see which variant has overall higher performance
print("Median A:", np.median(variantA))
print("Median B:", np.median(variantB))
In the above snippet, we compare variant A and variant B. After confirming whether the p-value is below a chosen significance threshold (e.g., 0.05), we also look at medians (or mean ranks) to decide the winning variant. Even if the result is statistically significant, the difference between medians needs to be practically relevant before concluding which variant is better for production deployment.
If the data is extremely small, other methods like randomization tests or bootstrap confidence intervals can also be employed, because they provide distribution-free estimates without as many assumptions as parametric approaches.
Potential Follow-up Questions
What are the main pitfalls if someone still tries to use a parametric test under non-normal distributions?
Using a standard t-test under strong non-normal conditions or small sample sizes can lead to inflated Type I or Type II errors. A t-test assumes that the sample means are normally distributed, and if the data are heavily skewed or have outliers, the test results may be misleading. The consequence is that you might incorrectly conclude a variant is effective (Type I error) or fail to detect an actual difference (Type II error).
When is it acceptable to still use a t-test?
A t-test can be acceptable if:
The Central Limit Theorem applies, meaning you have a sufficiently large sample size so that the distribution of the sample mean approximates normality.
You can verify, perhaps through residual plots or Shapiro-Wilk tests, that deviations from normality are minimal.
The data are not heavily skewed or contain extreme outliers that invalidate the typical assumptions.
With large datasets (like in Uber Rider experiments), mild deviations from normality usually get mitigated by the Central Limit Theorem, making parametric tests more acceptable.
How do you decide on an effect size measure for the Mann-Whitney test?
The Mann-Whitney U test checks whether one distribution is stochastically larger than another. An effect size helps quantify the magnitude of the difference. One widely used measure is rank-biserial correlation, which is computed from the Mann-Whitney U statistic and the sample sizes. Another simpler approach is to compare descriptive statistics such as median or interquartile range of the two groups. For decision-making in AB tests, the difference in medians often holds intuitive meaning.
Could we apply a Bayesian approach here?
Yes, a Bayesian AB test can be used to estimate posterior distributions of each variant’s performance metric without requiring normality assumptions. Bayesian methods can provide a probability that one variant is better than another. This approach is beneficial for small sample sizes because it allows incorporating prior distributions, which can guide the inference when data are sparse. However, properly specifying priors requires domain expertise, and some organizations may find frequentist methods more straightforward to communicate.
How would you handle multiple metrics or multiple variations in the experiment?
If testing multiple metrics simultaneously, adjustments for multiple comparisons might be needed, such as a Bonferroni or Benjamini-Hochberg correction, to control false discovery. If testing more than two variations, a nonparametric extension like the Kruskal-Wallis test can be used; if a significant result is observed, you can then perform pairwise comparisons with appropriate corrections. For multi-armed experiments, advanced methods like multi-armed bandits can be employed to adaptively allocate traffic based on performance.
Are there situations where a bootstrap approach might be better?
Bootstrapping is often preferred when sample sizes are small, or when the metric’s distribution is heavily skewed or complicated. It does not rely on normality assumptions because it uses resampling to build empirical distributions. By repeatedly sampling from the observed dataset, you can estimate confidence intervals for the difference between two variants. This approach can give you robust inference even when parametric tests fail to capture unusual distributional shapes.
How would you define a winning variant if the difference is statistically significant but operationally small?
Statistical significance alone does not always guarantee practical relevance. The difference might be too slight to justify rolling out a new variant, given potential costs or risks. One would typically set a minimum detectable effect in the design phase of the experiment, meaning you only consider a variant the winner if its improvement meets or surpasses a threshold (for instance, a certain percentage lift in key metrics). This ensures the difference is both statistically and practically significant for the business context.
Below are additional follow-up questions
How do you handle missing data or incomplete records in a small-scale A/B test?
In a small dataset, every data point matters more, so even a few missing entries can skew results. One approach is to perform simple imputation strategies, like filling missing values with the median or mean, but this can dilute natural variability. More advanced methods such as multiple imputation can capture the uncertainty around missingness by creating multiple plausible datasets and pooling results. However, any imputation strategy can introduce biases if the data are not missing at random. For example, if users who drop out are systematically different (perhaps users on older devices or slower networks), your analysis can become distorted. Whenever possible, it is best to minimize missingness upfront with careful data-collection protocols (e.g., ensuring stable connectivity or verifying that a user event was indeed logged).
A subtle pitfall arises if the imputed values for one group systematically shift the distribution, making nonparametric tests appear more or less significant. For small samples, always double-check final distributions and consider sensitivity analyses (e.g., worst-case or best-case assumptions for the missing data) to see how robust the test is.
What if your performance metric changes over time because of seasonality or user behavior shifts?
Time-dependent factors like seasonality, holidays, or even software releases can dramatically alter user behavior. For instance, a new rideshare feature might appear more effective simply because it was tested during a major festival period when ridership is high, not because of any inherent superiority of the product change. In small-scale tests, this effect can be amplified because you have fewer data points. One way to address this is to collect data over multiple time blocks or cycles, so each variant is exposed to various conditions. If the test cannot be run simultaneously (e.g., a sequential rollout), consider randomizing time blocks carefully or employing techniques like a blocked experimental design to account for known seasonal variations.
A real-world edge case is when user behavior systematically changes within the test window, perhaps due to media coverage or competitor promotions. To minimize confounding, you might incorporate a time-series approach that accounts for temporal trends, or at least ensure that each experimental variant encounters these conditions equally by randomly assigning time blocks in a balanced manner if possible.
Could you ever combine nonparametric approaches with sequential testing in a low-data setting?
Yes, but it requires careful planning. Sequential testing (also known as “peeking” at the data) lets you stop the experiment early if a clear difference emerges. However, sequential methods have complex error rate considerations—each time you peek, you risk inflating the probability of a false positive. Most established sequential testing frameworks (like Pocock or O’Brien-Fleming spending functions) assume something akin to normality, so using a Mann-Whitney U or other rank-based test requires specialized or custom sequential boundaries. For small data, it is easy to falsely conclude significance or continue the test longer than necessary if the effect size is unstable. A robust solution might be a Bayesian sequential approach, but then you need to be comfortable specifying priors and interpreting posterior probabilities.
A hidden pitfall is that if you keep checking the data at frequent intervals, the random fluctuations in a small dataset can be misinterpreted as meaningful. This underscores the need for strong methodological discipline: either predefine the number of checks and how to correct for multiple looks, or adopt a well-vetted Bayesian approach with threshold rules.
How would you address outliers that might disproportionately affect nonparametric tests?
Nonparametric tests like the Mann-Whitney U are more robust than parametric tests but not entirely immune to extreme outliers. In a small dataset, a single extreme value can still heavily influence rank positions. Sometimes, these outliers are genuine high-usage customers or anomalies like test accounts. A typical approach is to analyze them separately (e.g., a separate analysis for high-volume or power users) or cap their values using winsorization. Yet, this should not be done arbitrarily. For instance, capping a valid outlier could remove an important signal if your goal is to understand heavy-user behavior.
An edge case occurs when both groups might have outliers but only one set’s outliers are systematically high. That pattern can yield large rank discrepancies. Always investigate the context around these extreme values. If they represent a consistent subset of users who are fundamentally different from the rest (e.g., enterprise clients in an otherwise retail user base), it might be more appropriate to analyze them in a separate bucket or run a more targeted experiment for that subgroup.
How do you decide if your small-sample test is worth continuing versus pooling data from other sources?
In practice, sometimes you realize that the test is underpowered and your data cannot yield statistically reliable insights. At that point, it might be more beneficial to pool data with a related experiment or gather more historical data if the metric is comparable and the circumstances are similar (e.g., the same user segment, same time window conditions). However, combining data sets requires caution about whether the older data truly represent the current environment. Pooling can also mask changes that have occurred between time periods (e.g., shifting user demographics).
A subtle drawback arises if your earlier data are systematically different from the new data because of new policies or user acquisition strategies. That mismatch may compromise the validity of your findings. So you always have to confirm that the data you are combining align in definitions, user behavior, and time frame to avoid introducing confounders.
What if you suspect that the control and treatment groups might differ in user types, despite randomization?
Random assignment is usually the best safeguard against systematic differences, but in small-sample scenarios, even randomization can fail to fully balance groups. If you suspect an imbalance—perhaps the control group contains more new users, while the treatment group has more power users—you can attempt post-stratification. This involves segmenting your population into meaningful strata (e.g., new vs. existing users) and analyzing the treatment effect within each stratum. You then combine the results in a weighted fashion to produce an overall estimate.
A potential pitfall is over-segmentation: with a small dataset, further splitting can lead to minuscule sample sizes within each stratum, undermining the statistical power. You must ensure that each stratum still has enough data to support a meaningful comparison. Another edge case is the existence of unknown lurking variables—factors you have not even considered measuring—that might affect the outcome and are unevenly distributed across variants.
How do you interpret results if both the control and treatment groups appear to have overlapping rank distributions?
Unlike a classic t-test where you might compare the difference in means, nonparametric tests compare the ranks. If the distributions overlap considerably but still produce a statistically significant result, it means that one variant systematically ranks higher more often, even if the absolute difference in raw values might be small. The main question is whether that difference is practically relevant. You can look at how frequently one variant outperforms the other on a per-user basis (sometimes called the probability of superiority). Even if these rank-based metrics suggest significance, the real-world payoff might be negligible.
An overlooked subtlety is that in smaller data sets, the difference in ranks might come from a few key data points rather than a broad-based shift in distribution. Always plot both distributions—boxplots or violin plots of the two groups’ data can help reveal how much overlap there is and whether a few data points are driving the conclusion. If the difference is driven by a tiny subset, reevaluate whether that subset is critical to your business objectives or if those data points represent anomalies.
How do you set up confidence intervals for the median differences in a nonparametric setting?
Confidence intervals around medians can be trickier than around means when distributions are non-normal. One method involves bootstrapping: resampling your data with replacement many times to compute the median difference each time. The percentile bounds of that distribution serve as the confidence interval. This approach is robust, but it can be computationally expensive, and in small datasets, the interval might be quite wide, reflecting high uncertainty.
An edge case arises when the distribution has multiple modes (i.e., multiple peaks). The median might not fully capture the central tendency if the data cluster around distinct modes. Confidence intervals could reflect that complexity in ways that are non-intuitive (e.g., the interval might span multiple modes). Always verify that focusing on the median is truly representative of your business or product metric and consider alternative central tendency measures (like the trimmed mean) if your distribution is extremely skewed or multi-modal.
How might business constraints influence decisions even if the statistical result is inconclusive?
In real scenarios, decisions often hinge on timelines, resource allocations, or strategic imperatives. Even if your nonparametric test returns a p-value above the significance threshold, the business might still decide to roll out the new feature if it aligns with broader goals or user feedback. Conversely, you might hold off on implementing a new variant that shows minor improvement but requires significant engineering work or poses higher risk.
A subtle real-world issue here is stakeholder pressure. In some cases, a product manager or executive has a strong preference or a tight deadline, which can override inconclusive statistical evidence. The risk is that you might deploy a feature that is not actually beneficial or miss out on a beneficial change due to corporate or political reasons. Communication of test results—emphasizing both the uncertainty (e.g., wide confidence intervals) and the potential upsides or downsides—is crucial so that business stakeholders can make an informed decision.