ML Interview Q Series: Average Wait Time for First User Selection: A Geometric Distribution Approach.

May 26, 2025

Browse all the Probability Interview Questions here.

Imagine there are one million total users, and each day we randomly pick 1000 of them to run an experiment. A user might be chosen multiple times across different days. How long, on average, does a person wait before they are chosen for the first time, and what is the probability of being selected on the very first day? Is that probability closer to zero or to one?

Comprehensive Explanation

Each day, 1000 users are chosen out of 1,000,000. The probability for any specific user to be picked on a given day is 1000/1,000,000 = 0.001. If we let X be the random variable denoting the number of days until a particular user is selected for the first time, then X follows a geometric distribution with parameter p = 0.001.

The expected waiting time of a geometric random variable with success probability p is:

Here, p is 0.001, so E(X) = 1/0.001 = 1000 days. In other words, on average, it takes about 1000 days for a user to get selected for the first time.

The likelihood that someone is chosen on the first day is p = 0.001. Since that value is 0.001, it is much closer to 0 than to 1.

Detailed Logic

When a user’s chance of selection each day is p, the waiting time until first selection is often modeled by the geometric distribution. This distribution captures scenarios where each trial (in our case, each day) offers an independent probability p of “success” (being selected). The expectation of a geometric distribution indicates the average number of trials needed to see one success.

For the probability of selection on the first day, we simply note that among 1,000,000 users, 1000 get chosen daily with replacement. Hence the probability 0.001. Because 0.001 is quite small, it indicates that most users will not be selected on day one, so it is definitely closer to 0 than to 1.

Example Simulation in Python

import random

def simulate_selection(num_users=1_000_000, picks_per_day=1000, simulations=10000):
    import math
    waiting_times = []
    user_id = 42  # Arbitrary user of interest

    for _ in range(simulations):
        days = 0
        selected = False
        while not selected:
            days += 1
            # Probability that user is chosen on this day
            if random.random() < (picks_per_day / num_users):
                selected = True
        waiting_times.append(days)

    return sum(waiting_times) / len(waiting_times)

average_wait = simulate_selection()
print("Approximate average waiting time:", average_wait)

This code uses a simple loop where each iteration checks if the user is picked that day with probability picks_per_day/num_users. The simulation tallies the number of days until the user is chosen. After many simulations, the average number of days approximates 1/p = 1000.

Potential Pitfalls or Edge Cases

Sampling with or without replacement could alter the probability p on a given day. The question statement implies sampling is effectively with replacement across days, so each day’s probability for a single user remains constant at 0.001.

If a product team tried to avoid re-testing the same user within a window, that would mean a changing probability across days. In that scenario, the waiting time distribution would no longer be strictly geometric. Such modifications would alter expected waiting time calculations and the probability of being chosen on any particular day.

Large-scale experiments with a big user pool often require checks that a randomization scheme remains valid over time. If the daily picks are not truly random, it can shift the probability distribution.

Follow-Up Questions

How does variance factor into this waiting time analysis?

The variance of the geometric distribution with parameter p = 0.001 is (1 - p) / p². With p = 0.001, that is (0.999) / (0.001²) = 999,000. This means the waiting times can vary widely, and some users might get selected relatively quickly, while others might wait much longer than the average.

What if we wanted to ensure each user is tested exactly once?

If every user were to be tested exactly once without any repeats, the problem would transform into a different scenario more akin to a coupon collector–style problem, though the test is only exposing 1000 users per day. The probability that a user is selected on a particular day would not remain constant, and the expected waiting time would depend on how the experiment orchestrates the selection without repetition. You would have to adjust for the fact that once a user has been selected, they are effectively removed from the pool for future draws.

Is there any practical consideration about testing fatigue if users can be chosen multiple times?

In real systems, picking a user multiple times can cause annoyance or bias in feedback. In practice, many experiment frameworks implement logic to avoid repeated selections within short time windows. This means the actual daily probability for each user would change dynamically, requiring a more complex analysis to determine the true waiting time.

Below are additional follow-up questions

How does the random sampling strategy affect the interpretation of experimental results?

One of the subtle issues with sampling users randomly each day is that certain groups might be overrepresented or underrepresented by chance, especially if the sample size is relatively small compared to the population. In the scenario of 1000 picks out of one million, the overall fraction of the population tested each day is small (0.1%). However, purely random selection is still subject to statistical fluctuation.

Edge cases might arise if there is natural clustering in the user base (for example, different time-zone distributions, different levels of user engagement, or multiple user attributes that might influence the outcome). If the distribution of randomly picked users ends up leaning too heavily toward one subgroup by chance, it can skew the test results.

In a real-world setting, teams often use stratified sampling or weighted sampling to ensure sufficient representation of different segments. This helps reduce the probability that random chance will produce biased experimental groups. If random sampling is poorly executed or not truly random, significant biases can creep in (e.g., selection restricted by device types or partial coverage due to network constraints). Such biases could alter the test's validity.

How do we handle users who join the system after the experiment has already started?

In practice, user populations are not static. For example, if the system is growing, new users may join daily. If we keep the approach of picking 1000 users out of one million each day, we might overlook the dynamic nature of daily growth. After some months, the total user base might exceed one million, and the initial probability estimate of 0.001 for a user being selected each day would need re-evaluation.

In many real-world systems, the daily picks must adapt to the changing user base. If new users join, they should be included in the random selection pool from the day they sign up. That implies the population on day d might be 1,000,000 + new_users(d). The probability for each user to be picked changes accordingly to picks_per_day / current_population. Failing to account for these changes could mean new users never get tested until the daily randomization logic is updated, which distorts both the test coverage and experimental results.

What if the experiment’s capacity of 1000 picks per day is not constant?

In some real-world environments, the experiment capacity is not fixed. For instance, marketing or technical constraints might dictate that on some days, 2000 users can be tested, while on other days only 500 can be tested. That inconsistency affects the daily probability for each user and also changes the expected waiting time distribution. If the number of daily picks is variable, the probability p each day is picks_per_day(d) / total_users(d), meaning the waiting time is no longer a straightforward geometric distribution.

From an implementation standpoint, if you expect to scale up or down the daily test size, you have to recalculate probabilities dynamically. Teams also need to consider user fairness and coverage over time—if the test size is drastically reduced on certain days, some users might wait much longer than expected. Conversely, if test size spikes significantly, many users might get selected at once, causing a sudden surge of exposure (which could be beneficial or problematic, depending on the experiment’s goal).

How can correlation across days impact the assumption of independence in the geometric distribution?

The typical assumption in a geometric model is that daily picks are independent events. However, real experimental frameworks might implement constraints or weighting mechanisms that make selection on day d correlated with selection on day d+1. For example, if the system tries to avoid choosing the same user on two consecutive days, then having been selected on day d modifies the probability of selection on subsequent days.

This breaks the assumption of identical, independent Bernoulli trials. The waiting time until first selection in such correlated scenarios is no longer purely geometric. If the experiment is designed to reduce repeated picks over consecutive days, many users could end up with a slightly higher chance on days following a missed selection, or a lower chance if they have already been selected. The expected time to first selection could be shorter or longer than the standard geometric mean of 1/p, depending on the nature of the correlation.

What if some days are skipped (e.g., no tests over weekends)?

In many real-world applications, production releases or new experiments do not happen daily—perhaps only on weekdays. If the test picks 1000 people on each “active” test day and 0 on non-active days, the user’s effective probability of selection is not uniform day to day. Over a calendar with weekend gaps, the user’s waiting time in actual calendar days is no longer purely geometric with daily probability 0.001.

Even though the experiment might view these “active” test days as consecutive in an index sense, real clock time passes differently. If a user must wait through weekends when no picks occur, the real waiting time is inflated compared to a scenario where picks are made continuously every day. This difference becomes relevant when analyzing how quickly any user might see the new feature or test condition in real-world time.

How do we ensure fairness if certain user segments are more valuable to the business?

Business or product considerations might dictate that particular cohorts (e.g., premium users or highly active users) be tested more frequently. In that case, the daily selection is not simply uniform random sampling with probability 0.001 for each user. Instead, a weighting scheme might be used that assigns higher probability to certain users each day.

This introduces a different expected waiting time for different user groups. Premium users might get a daily probability of 0.002 while regular users get 0.0005, or some other ratio that aligns with business goals. Understanding and documenting these weighting choices is crucial to avoid misinterpretation of results: certain segments might see the test earlier and more frequently, while others might remain untested for longer, thus potentially biasing overall conclusions.

How might repeated exposure affect the data collected from the same user?

When a user can be selected multiple times, the nature of the data you collect may change with repeated exposure. For instance, the second or third time the same user sees the test feature, their behavior or feedback might differ from their initial reaction. This can complicate analysis because the experiment might mix “first exposure” data with “subsequent exposure” data.

If you are measuring user engagement, repeated test exposure can inflate or deflate engagement metrics. Some users might find repeated exposure annoying, leading them to disengage, while others might become more accustomed to the feature. Thus, the raw results need to be carefully segmented between first-time exposures and subsequent exposures to avoid conflating the true effect of the test on a never-before-seen user with the effect on a repeat user.

How can we handle the analysis if the same user is tested multiple times but responds differently?

Even beyond general engagement metrics, users can have inconsistent behaviors. For example, a user might rate the product positively the first time but negatively the next time, or purchase certain items on their first exposure yet ignore the feature on the second exposure. If the testing platform logs each user’s response every time they are exposed to the test, the analysis pipeline has to account for repeated measurements from the same subject.

In statistical modeling terms, repeated observations from the same user can violate the independence assumption in simpler analytical approaches (like a straightforward logistic regression treating every observation as independent). More robust methods (e.g., mixed-effects models, hierarchical models, repeated-measures ANOVA) might be required. These approaches explicitly account for the fact that each user contributes multiple data points, improving the accuracy of the final inference.

How do we track and measure coverage after a large number of days?

Over many days or months, it becomes important to know the fraction of the user base that has experienced the test at least once. Because users can be picked multiple times, you can’t just multiply daily picks by the number of days to find the number of unique users exposed. Some fraction of picks each day will be repeats.

One approach is to maintain a record (e.g., a flag in a database) whenever a user is first tested. Each day you update the count of newly exposed users and the total tested so far. Over time, you can estimate coverage: total tested so far / total user base. Pitfalls arise if the user base is growing fast—thus, you may never truly test all users if the growth outpaces the daily testing. In product decision-making, awareness of actual coverage helps you gauge if the experiment has reached enough unique individuals for meaningful aggregate insights.

What complexities arise if results from the test are used to update the pool or the probability of user selection?

Sometimes, an experiment might be adaptive, using the results from prior days to shape the next day’s test picks. For instance, if an early result suggests a particular segment responds favorably, you might increase the probability of picking similar users on subsequent days to gather more granular data. This adaptivity means the picking process is no longer memoryless. The waiting time for some users might shorten if they become more likely to be picked because of the new evidence, while it could lengthen for others who are deprioritized.

This can complicate the interpretation of waiting times and the distribution of exposures. You effectively have a feedback loop where the experiment learns from daily outcomes and modifies selection probabilities. Properly analyzing such adaptive experiments often requires specialized statistical methods, such as multi-armed bandit algorithms or Bayesian adaptive designs, which specifically account for the iterative update of sampling distributions based on observed results.

Rohan's Bytes

Discussion about this post