ML Interview Q Series: Calculating Expected Overlap and Probability Using Indicator Variables & Hypergeometric Distribution
Browse all the Probability Interview Questions here.
Bill and Matt each choose five different numbers at random from the set {1,2,...,100}. What is the expected number of common numbers in their choices? What is the probability that the choices of Bill and Matt have at least one number in common?
Short Compact solution
Define an indicator variable Ik that is 1 if both Bill and Matt choose the number k and 0 otherwise. Because each of them picks 5 distinct numbers from the 100 available, P(Ik = 1) = (5/100)(5/100) = 0.0025. Hence, by the linearity of expectation, the expected number of common numbers is:
E(X) = sum over k=1 to 100 of E(Ik) = 100 × 0.0025 = 0.25.
The probability distribution of the random variable X (the number of common numbers) is hypergeometric:
P(X = k) = [C(5, k) × C(95, 5 − k)] / C(100, 5),
and therefore the probability that Bill and Matt have at least one number in common is:
1 − P(X = 0) = 0.2304.
Comprehensive Explanation
Understanding the problem
We have two individuals, Bill and Matt, each choosing 5 distinct numbers out of 100 possible choices. We are interested in two quantities:
The expected number of common numbers in their chosen sets.
The probability that there is at least one common number between the two sets.
Indicator-variable approach for the expectation
One of the most straightforward ways to compute an expectation of “how many items are shared” is to use indicator variables. Let us define Ik = 1 if Bill and Matt both pick the number k, and Ik = 0 otherwise. Because Bill chooses 5 out of 100 and Matt also independently chooses 5 out of 100:
P(Ik = 1) = (5/100) × (5/100).
The expectation of Ik is simply this probability, 0.0025. There are 100 possible numbers k = 1, 2, …, 100, so the total number of common numbers X can be written as
X = I1 + I2 + ... + I100.
Since expectation is linear regardless of dependencies among the Ik, the expected value is:
Here, • 5 choose k counts the ways Matt can pick the k numbers that Bill chose, • 95 choose (5 − k) counts how Matt picks the remaining 5 − k numbers from those that Bill did not choose, • 100 choose 5 is the total number of ways Matt can choose any 5 numbers from 100.
Probability of at least one match
Often, we want to know how likely it is for the two sets to overlap by at least one number. That is P(X ≥ 1). It is simpler to compute 1 − P(X = 0). From the hypergeometric formula,
P(X = 0) = [C(5, 0) × C(95, 5)] / C(100, 5).
Thus,
Hence, there is about a 23.04% chance that Bill and Matt share at least one number.
Reasoning behind the hypergeometric perspective
One way to conceptualize it is: Bill’s 5 numbers can be thought of as “marked” among the 100. Matt then draws 5 from 100 without replacement. The random variable counting how many of Matt’s drawn items are among Bill’s marked items follows the hypergeometric distribution. This is analogous to drawing 5 red balls (Bill’s picks) that are hidden among a large urn containing 100 total balls, 5 of which are “red” (Bill’s picks) and 95 of which are “white” (not Bill’s picks).
Simulation approach (optional illustration)
We can verify the result using a quick simulation in Python:
import random
import numpy as np
trials = 10_000_00
common_counts = []
for _ in range(trials):
bill_nums = random.sample(range(1, 101), 5)
matt_nums = random.sample(range(1, 101), 5)
overlap = len(set(bill_nums).intersection(matt_nums))
common_counts.append(overlap)
avg_common = np.mean(common_counts)
prob_at_least_one = np.mean([1 if x >= 1 else 0 for x in common_counts])
print("Estimated E(X):", avg_common)
print("Estimated P(X >= 1):", prob_at_least_one)
You would see the simulation results converge close to 0.25 for the average number of overlaps, and about 0.23 for the probability of at least one overlap.
Follow-up question: Could we approximate this probability using a Poisson approximation when 5 is small compared to 100?
Yes, one can approximate the occurrence of each number being chosen by both Bill and Matt as a rare event. With 100 possible numbers and a probability of 5/100 × 5/100 = 0.0025 for each “match” event, we might consider X to be the sum of 100 Bernoulli(0.0025) indicator variables. That sum is approximately Poisson(100 × 0.0025) = Poisson(0.25).
Under this approximation: • E(X) = 0.25, matching our exact calculation. • P(X ≥ 1) ≈ 1 − e^(−0.25). Numerically, that is about 0.2212.
This value (0.2212) is close but not exactly the same as the exact 0.2304, reflecting the approximate nature of the Poisson assumption. However, for larger sets or slightly bigger picks (still small compared to 100), Poisson can be a reasonable shortcut if exact hypergeometric calculations become cumbersome.
Follow-up question: How would the result change if Bill and Matt had different numbers of picks?
If Bill chose m numbers and Matt chose n numbers from the set of 100 without replacement, and you still wanted the expected overlap, you could define an indicator Ik for each k in {1,...,100} as before. Then P(Ik = 1) would be (m/100)(n/100) for each k (assuming all are distinct picks within each person’s selection). The expected count would be 100 × (m/100)(n/100) = m × n / 100.
For the probability distribution, you would adapt the hypergeometric formula to P(X = k) = [C(m, k) × C(100−m, n−k)] / C(100, n), with suitable restrictions on k. The formula always aligns with the notion that you have m “marked” items in the universe of 100, and Matt draws n from that same universe. The distribution of the overlap is hypergeometric by definition.
Follow-up question: What if the numbers are chosen with replacement?
In that scenario, each person’s choice of 5 numbers is not forced to be distinct. The probability analysis for the expected overlap would change, since the chance that Bill picks k is 5 times the probability for a single draw, but those draws are now independent across the 5 picks. Specifically, for a single number k, the probability Bill picks it at least once out of 5 draws (with replacement) is 1 − (probability that Bill does not pick k in any of his 5 draws). Similar logic would apply for Matt. The resulting distribution is no longer hypergeometric but rather based on binomial or multinomial considerations.
However, you could still use indicator variables: Ik = 1 if Bill picks k at least once and Matt picks k at least once. Then find P(Ik = 1) using the complement approach for each person not picking k at all. You would again sum up these indicators for the total overlap. The final probability distribution for X would be more involved (it wouldn’t be as straightforward as the hypergeometric distribution), but the indicator-variable method would still yield the correct expected value.
Follow-up question: How does correlation between Bill’s and Matt’s picks affect the result?
So far, we have implicitly assumed that Bill and Matt choose their sets independently. If there is correlation (for instance, if Matt tends to pick numbers close to Bill’s choices), then the probability that they share numbers can increase or decrease depending on the nature of that correlation. In practice, you would need to incorporate the joint distribution of Bill’s and Matt’s picks and recalculate P(Ik = 1) accordingly.
For the expectation, the same reasoning via indicator variables still applies:
E(X) = sum over k of P(Ik = 1),
but now P(Ik = 1) might be different from (5/100)(5/100), or might even vary with k. Thus the final average overlap E(X) could be significantly different from 0.25 if the picks are heavily correlated.
In summary, the hypergeometric framework is the natural tool for dealing with “shared distinct picks” from a fixed population when choices are made independently. The key results:
• Expected overlap is 0.25. • Probability of at least one overlap is 0.2304.
Below are additional follow-up questions
Follow-up question: What is the variance of the number of common numbers, and how do we derive it?
One common extension after finding the expectation is to ask about the variance of the overlap. Recall that X is the number of common numbers when Bill and Matt each choose 5 distinct elements from a set of 100.
We can continue using indicator variables Ik, which equals 1 if both Bill and Matt choose number k, and 0 otherwise. Then
X = I1 + I2 + ... + I100.
The variance can be expressed in terms of indicator random variables:
Each Ik is a Bernoulli random variable with parameter p = (5/100)(5/100) = 0.0025. Thus, Var(Ik) = p(1 − p).
The tricky part is the covariance between Ii and Ij for i ≠ j. Because the picks are without replacement for Bill and Matt individually, there are slight dependencies. One has to account for the fact that if Bill chooses number i and j, then it changes the probabilities for how Matt can choose or not choose them.
A direct route is to recall that X follows a hypergeometric logic, but computing variance directly from the hypergeometric distribution formula is often simpler:
X ~ Hypergeometric(5, 5, 100),
for which the known variance is:
Var(X) = 5 × 5 × (100 − 5) × (100 − 5) / [100² × (some factor)],
more precisely:
Var(X) = (5)(5)(100 − 5) / [100 × 99],
because from the standard hypergeometric variance formula, we have
Var(X) = m n (N−m) (N−n) / [N² (N−1)],
with N = 100, m = 5, n = 5. Substituting these values gives:
Var(X) = (5 × 5 × 95 × 95) / [100² × 99]
when derived step-by-step. In practice, this yields a number around 0.237 (slightly less than 0.25). The exact numeric value typically is computed to confirm the theoretical result.
Pitfalls or edge cases:
Overlooking the dependence between indicators can lead to mistakenly summing p(1 − p) over all 100 and ignoring correlation terms.
In an interview, stating Var(X) = 100 p(1 − p) is incorrect, because it would be a binomial-like assumption ignoring the without-replacement aspect.
Using the hypergeometric variance formula is the most straightforward approach if you recall it directly.
Follow-up question: How might the results change if Bill’s picks are not uniformly random across the 100 numbers?
In realistic scenarios, Bill’s selection might not be uniform. For example, Bill might be more likely to choose certain “favorite” numbers. Let us imagine that Bill picks 5 distinct numbers with a different probability distribution over the 100 numbers, say weighted by personal preference.
In that case, for each number k, define pk = Probability(Bill picks k). Similarly, define qk = Probability(Matt picks k). Then for each k, the probability that both pick it (assuming their picks remain independent across Bill vs. Matt) is:
pk × qk.
Thus, the expected total overlap can be found by summing these:
Expected overlap = Σk=1 to 100 [pk × qk].
However, if Bill’s or Matt’s picks are themselves subject to the constraint “exactly 5 distinct numbers,” the probabilities pk and qk must reflect that. This can make the distribution more complicated than the simple hypergeometric scenario. The fundamental approach to the expectation is still linearity, but the explicit distribution of X changes significantly.
Potential pitfalls:
Assuming uniform distribution when there is a known preference for certain numbers may yield misleading results.
Handling the constraint that each person must pick exactly 5 distinct numbers is trickier if each number has different selection probabilities. One must often resort to combinatorial or simulation-based methods to get the exact distribution.
Follow-up question: What if we pick numbers one at a time in sequence? Does the order of picking affect the probability of at least one match?
Sometimes, an interviewer might ask if the order in which Bill and Matt pick their numbers influences the probability of overlap. For example, Bill picks his first number, then Matt picks his first, and so forth, all without replacement within their own sets. However, if each ultimately ends up with a set of 5 distinct numbers, the final probability of overlap does not depend on the order in which they were chosen, assuming no additional information is shared between them during the picking process.
Pitfalls:
Believing that “if Bill picks first, Matt has a higher or lower chance to overlap” can be incorrect if Matt is not reacting to Bill’s picks and if the final outcome is the same (5 distinct picks from Bill, 5 distinct picks from Matt).
If Matt had adaptive knowledge of Bill’s picks, that would drastically change the probabilities. In that case, the distribution is no longer the simple hypergeometric.
Follow-up question: What if there are errors in recording the picks, or if a number can be accidentally chosen twice by the same person?
Real-world processes sometimes introduce errors. For instance, Bill might inadvertently repeat a number or might skip a number. Or there could be data entry mistakes in logging which numbers were chosen. This means we might not truly have “5 distinct picks” per person.
If duplicates are allowed for Bill’s picks, Bill might end up with fewer than 5 unique numbers. That effectively changes the logic behind the counting of overlaps.
If data-entry errors or partial sampling occur, the distribution can become quite different.
Potential pitfalls:
Treating the scenario as hypergeometric when in reality duplicates might exist.
Overlooking the mismatch between the theoretical assumption (“5 distinct choices from 1..100”) and real data collection issues.
Follow-up question: How would we handle continuous analogs of this problem, for example, if Bill and Matt choose points in an interval [0,1] instead of discrete numbers?
Sometimes, an interviewer might push to see if you can generalize from discrete sets to continuous spaces. If Bill and Matt each choose 5 distinct points at random in the interval [0,1], the question “What is the expected number of common points?” becomes somewhat meaningless, as the probability of picking exactly the same point in a continuous space is zero.
A better continuous analog might be: “How many points from Bill’s sample are within some small epsilon of points from Matt’s sample?” However, that changes the entire formulation. We might use Poisson processes or other continuous distribution approaches. In practice, you typically do not expect overlaps in a truly continuous pick unless you define a tolerance for closeness.
Pitfalls:
Directly trying to translate discrete combinatorial results to continuous random variables can lead to misunderstandings about measure and probability zero events.
Properly defining “common” in a continuous context often requires defining a neighborhood or distance threshold.
Follow-up question: What if we needed a confidence interval around the empirical probability of at least one match after observing some data?
In an applied setting, you might actually run experiments: Bill and Matt each pick 5 numbers many times, and you observe the fraction of trials in which they share at least one number. You might want a confidence interval on that observed fraction. This is effectively a binomial setting if each trial is “overlap or not,” with probability p = P(X ≥ 1).
Let Y be the number of successful overlaps in T total trials. The empirical estimate of p is p̂ = Y / T.
A straightforward approximate confidence interval can then be found using the standard binomial confidence interval approaches (e.g., Clopper-Pearson or Wilson intervals).
Potential pitfalls:
Overlooking the possibility that trials might not be truly independent if Bill and Matt are influenced by previous picks or by external factors.
Failing to account for a small sample size T, which might require exact methods rather than normal approximations.
Follow-up question: If there were more than two people (say Bill, Matt, and Susan), and each picks 5 distinct numbers, how do we find the expected number of common numbers among all three?
This generalizes the problem to an intersection among multiple subsets. If Bill, Matt, and Susan each choose 5 distinct numbers from the same set of 100, we can define:
X to be the number of numbers that all three have in common.
We can extend the indicator variable idea. For each k in {1,...,100}, define Ik = 1 if Bill, Matt, and Susan all pick k, and 0 otherwise. Then:
X = I1 + I2 + ... + I100.
The probability that a particular k is chosen by all three is (5/100)(5/100)(5/100) = 0.0025 × (5/100) again if picks are made independently for each person. So the expected intersection is:
E(X) = 100 × (5/100)³ = 100 × (125/1,000,000) = 0.0125.
Pitfalls:
Not realizing that with three or more people, the overlap probability can become much smaller.
Confusing the event “all three share at least one number” with the event “they share exactly one number.” The logic or distribution changes significantly with more sets.
Follow-up question: How can we adapt the logic if the numbers 1..100 do not all have the same chance of being available?
Imagine that we do not have a uniform “pool” of 100 numbers—maybe some numbers are “unavailable” with certain probabilities, or we have partial restrictions (e.g., each player is only allowed to choose from certain subsets). In such scenarios, the straightforward hypergeometric approach is invalid because it relies on equally accessible choices without replacement from the entire set.
We can approach it in steps:
Determine the effective set from which Bill can pick.
Determine the effective set from which Matt can pick.
Model or directly compute how many overlaps can occur given these restricted sets.
Often, the solution might require simulations or a generalized combinatorial formula. The expected value might still be tackled by indicators if you carefully adjust for the probability that each k is actually available and chosen by both parties.
Pitfalls:
Incorrectly applying the standard hypergeometric formula when the real selection process is restricted or truncated.
Overlooking that if certain numbers have zero probability of being chosen by Bill, they cannot appear in the overlap.
Follow-up question: How do we incorporate knowledge about the picks after partial observations?
Suppose you already know that Bill has chosen some subset {b1, b2, b3}, but not the other two of Bill’s picks, and you want to update the probability that Matt shares at least one number with Bill. This leads us into conditional probability. Once you have partial information about Bill’s picks, the probability that Matt’s set overlaps with Bill changes in an update fashion.
If you know 3 of Bill’s picks, the probability that Matt matches at least one of those 3 might already be partially computed.
The distribution for the remaining 2 unknown picks from Bill’s side should be taken into account if you want the exact updated probability.
Pitfalls:
Forgetting that partial information changes the sample space drastically. You cannot simply rely on the unconditional hypergeometric formula if you have partial knowledge of Bill’s set.
Using naive “the rest picks are the same as random” logic can lead to subtle under/overestimation of the actual overlap probability if you do not condition properly on the events.