ML Interview Q Series: Probability of First Repeat: Finding the Most Likely Position in Random Selections.
Browse all the Probability Interview Questions here.
Twenty-five people each choose at random a number from {1, 2, …, 100}, independently. Next the chosen numbers are announced one by one. The first person (if any) who announces a number that has been announced before wins a bonus. Which person has the largest probability to win the bonus?
Short Compact solution
Let pk be the probability that the k-th announcement is the first repeated number. From the product rule for probabilities, we get:
p1 = 0 (the first announcement can never repeat a previously announced number).
p2 = 1/100.
p3 = (99/100) * (2/100).
In general, for k=4, …, 25:
Numerical evaluation shows that p11 is the largest among these probabilities, with a value of approximately 0.0628.
Comprehensive Explanation
Understanding the Setup
We have 25 individuals, each independently picking an integer from 1 to 100 with uniform probability. These chosen numbers are then announced in some order (say, person 1 announces first, person 2 second, and so on). The moment someone announces a number that has already been announced by any previous person, that announcer wins a bonus. We want to find which person index k (among the 25 people) maximizes the probability of being the first to repeat a previously announced number.
Decomposing the Probability pk
We define pk as “the probability that the k-th person’s announcement is the very first time a number is repeated.” To achieve this:
The first (k − 1) announcements must be all distinct from each other.
The k-th announced number must match exactly one of the previously announced (k − 1) numbers.
Probability that the first (k − 1) announcements are all distinct
Each of the first (k − 1) people picks a number from 1 to 100. The probability that all (k − 1) chosen values are different can be reasoned as follows:
The first person can pick any number freely, so the probability is 1 of being distinct (no previous draws).
The second person must pick a different number from the first person’s pick. The probability is 99/100.
The third person must pick a different number from the first two’s picks. The probability is 98/100.
Continue this pattern until the (k − 1)-th person picks a number that is different from all previously chosen (k − 2) numbers. The probability is (100 − (k − 2))/100 = (102 − k)/100.
Multiplying these probabilities together yields:
(100/100) × (99/100) × (98/100) × … × (100 − (k − 2))/100.
Because the first factor (100/100) is 1, it is often omitted in written form, so effectively we have:
(99/100) × (98/100) × … × (100 − (k − 2))/100.
Probability the k-th announcement is a repeat of one of the first (k − 1)
Now, person k must pick a number that has already been announced. There are (k − 1) distinct numbers in the set of previously announced numbers (under the condition they were all distinct). So the probability that person k picks one of these (k − 1) previously chosen numbers is (k − 1)/100.
Putting it together
Thus, pk = [Probability that the first (k − 1) picks are all distinct] × [Probability that the k-th pick is in that set of (k − 1) distinct values]. In the fully expanded product form:
This formula is valid for k ≥ 2, but note that p1 = 0 (no chance to repeat on the very first pick).
Why the Maximum is around the 11th Person
When k is too small (like 2 or 3), there are very few previously announced numbers to match with, so the chance of repeating is relatively small. On the other hand, when k is large (say above 20), it becomes more likely that a repeat might have already occurred before that point, lowering the probability that k-th is the very first. There is a sweet spot somewhere in the middle where there are enough previously distinct numbers to create a decent chance of collision, but not so many that it likely happened earlier. By direct calculation (or by writing a short script to compute pk for k=1..25), one sees the maximum occurs at k=11, with approximate probability 0.0628.
Potential Follow-up Questions
If the number of people was different, how would that change things?
The same formula extends naturally. If there are N people choosing from M possible numbers, the probability that the k-th person is the first repeat is:
Probability that the first (k − 1) picks are all distinct = (M/M) × ((M−1)/M) × ((M−2)/M) × … × ((M−(k−2))/M).
Then multiply by (k − 1)/M for the k-th to coincide with one of the (k − 1) distinct picks.
You could vary N (the total number of participants) and see which k from 1..N maximizes that quantity.
How does this relate to the Birthday Paradox?
The Birthday Paradox is a well-known probability phenomenon dealing with collisions in random selections, often with M=365 (days in a year). The “first repeat” notion is closely connected to that idea, though the classical Birthday Paradox focuses on the probability that at least one collision occurs among N picks rather than identifying which specific pick is the first to collide. However, the combinatorial arguments have similar structure. The maximum-likelihood index for the first repeat often ends up near the point where collisions become probable in the general “birthday” sense (roughly around the square root of M).
Is there a simple closed-form solution for which k maximizes pk?
Not a neat closed form. Typically, one either relies on:
Direct computation by evaluating pk for each k (especially if N and M are not too large).
Approximate analysis using Poisson or birthday-type approximations to find a near value for k that maximizes the probability.
How to quickly compute this in Python?
You could do:
import math
M = 100 # range of numbers
N = 25 # total participants
def probability_first_repeat(k):
# Probability first (k-1) distinct
prob_distinct = 1.0
for i in range(k-1):
prob_distinct *= (M - i) / M
# Probability k-th is a repeat of those (k-1)
return prob_distinct * (k - 1) / M
# Evaluate for k=1..N
best_k = None
best_val = 0
for k in range(1, N+1):
val = probability_first_repeat(k)
if val > best_val:
best_val = val
best_k = k
print("Best k:", best_k, "with probability:", best_val)
This code systematically checks each k in {1,2,…,25}, calculates pk, and reports which k gives the highest probability.
Could no one win the bonus?
Yes, if all 25 picks happened to be distinct, then no repeat occurs. The probability that all picks are distinct is:
(100/100) × (99/100) × (98/100) × … × (100 − (25 − 1))/100.
That quantity is nonzero, so it can happen that no bonus is ever awarded (although it’s somewhat unlikely with 25 picks out of 100).
Practical Real-World Implications
In practical large-scale settings (where M is large and N is also large), we typically reference the birthday paradox results to gauge how quickly collisions might occur. This interview-style question specifically focuses on the likelihood that each announcement could be the first collision, pinpointing which announcement index is the “sweet spot” for collisions—often around the 10th to 12th pick for 100 possible numbers and 25 participants.
All of these insights reinforce that the person at position k=11 (in this scenario of 25 participants picking from 1..100) has the highest probability, about 0.0628, of being the first to repeat a number.
Below are additional follow-up questions
What if each individual does not choose their number uniformly at random?
When the assumption of uniformity is relaxed, the problem of determining pk becomes more complex. Instead of each person having a 1/100 chance of picking each number, suppose some numbers are more likely than others. For instance, imagine a scenario where half of the people have a high preference for smaller numbers.
Effects on pk:
The probability that the first (k − 1) announcements are distinct now depends on the joint probability distribution of these (k − 1) picks. If certain numbers have significantly higher probability than others, collisions might be more likely even for smaller k.
The chance that the k-th pick matches one of the earlier (k − 1) picks depends on how “overrepresented” certain numbers have been. If the distribution skews toward a specific subset of numbers, the probability of collision can go up or down for different k.
Pitfalls and Edge Cases:
Highly skewed distributions: A few numbers might have extremely large probability mass, making collisions likely very early. This could mean the peak of pk might shift to a smaller k.
Participant-specific distributions: If each participant chooses from a different distribution (e.g., each has a distinct bias), the analysis must consider every combination of who might choose which number. The final expression for pk can become unwieldy without simplifying assumptions.
Dependence on population size: With fewer participants, biases can be less predictable. With more participants, large groups might yield early repeats because many are likely to pick from that popular subset of numbers.
How do correlations between picks change the solution?
In the original problem, each participant picks independently. However, if participants coordinate or if they try to avoid previously chosen numbers (or, conversely, mimic them), there will be correlations among picks.
Effects on pk:
If participants avoid numbers already seen, the probability of early repeats might drop (they are intentionally trying to prevent duplication). This strategy would reduce pk for smaller k but might cause a sudden jump in pk for some middle or later k if eventually a forced collision arises.
If participants mimic earlier picks, collisions become more likely and might occur at very small k.
Pitfalls and Edge Cases:
Partial knowledge: Perhaps each new participant only knows some fraction of previously announced numbers or incorrectly remembers them. The induced correlation pattern can be quite complex.
Strategic behavior: If the bonus is desirable, participants might attempt to pick in a way that maximizes their personal chance of collision (e.g., the 10th person might intentionally pick a number frequently chosen by people before them).
Unintended correlations: Real-world scenarios (like numbering patterns or groupthink) might create subtle correlations that distort the distribution from the idealized independent model.
What happens if the order of announcements is randomized rather than fixed?
In the usual problem setup, we assume person 1 announces first, person 2 second, etc. But suppose all 25 chosen numbers are known privately in advance, and then a random permutation decides the announcement order.
Effects on pk:
The random ordering of announcements means each permutation is equally likely. The index k is no longer tied to a specific person but to the k-th position in the random permutation.
We might still want to know “Which position in the random ordering has the highest chance of being the first repeated announcement?” The probabilities can be recomputed by counting how often that position turns out to be the first repeat in all permutations.
Pitfalls and Edge Cases:
Multiple announcements of the same value: If multiple participants chose the same number from the start, the question becomes: in the random ordering, which one of those people with the matching number appears first? If the next person in that set appears in position k, that’s your first repeat.
Symmetry: If all participants are identical and choose from the same distribution, certain symmetrical arguments can sometimes simplify calculations. But large combinatorial expansions might be needed if the total participants is big or the distribution is not uniform.
What if there are more participants than 100 (i.e., N > M)?
While the original question had 25 participants picking from 100 possible values, we might consider N > 100. Then it becomes more probable that a repeat occurs at some point.
Effects on pk:
For N > M=100, by the pigeonhole principle, there must be at least one duplicate among the picks. Hence there is a 100% chance of some repeat eventually.
However, finding the index k that maximizes pk still involves the same logic: the first (k − 1) picks must be all distinct, and the k-th is one of those earlier numbers. Once N exceeds M, you will never reach the final participant without a collision, but the question is which participant is most likely to be that first collision.
Pitfalls and Edge Cases:
Very large N: If N >> M, it might be nearly impossible to get too far into the sequence without a collision. The peak pk could shift to smaller k because collisions are likely to happen earlier.
Overlap among many picks: If many participants pick the same number, the first collision might happen extremely fast, overshadowing mid-range positions that would normally have the highest probability in more balanced scenarios.
How to generalize to continuous distributions or very large discrete sets?
Sometimes, you might have a continuous range of values (e.g., real numbers in [0,1]) or a huge discrete space. Then the notion of the “first repeat” might become extremely rare if the sample size is relatively small.
Effects on pk:
Continuous range: Mathematically, the probability of an exact repeat in continuous space (assuming a truly continuous uniform distribution) is zero if you only have a finite number of picks, since collisions occur with probability 0 in an uncountably infinite set.
Huge discrete space: If M is extremely large (e.g., on the order of billions or more) and N is small, repeats are also unlikely. The probability pk might be extremely small for all k, possibly pushing the maximum to a smaller or moderate k if it even shows a visible spike.
Pitfalls and Edge Cases:
Floating-point comparisons: In real-world computer systems, if picks are floating-point numbers, rounding might cause “accidental collisions” if you treat certain approximate matches as a collision. This introduces an implementation-level complication where the definition of “repeat” might not be truly identical values but “close enough” values.
Approximation: With very large M, we often use approximate formulas (e.g., Poisson approximation or birthday problem approximations) to estimate where the first repeat is likely to appear. Relying on direct combinatorial multiplication can be numerically unstable or computationally expensive.
How would the probability distribution shift if each pick can be repeated by the same person (multiple picks per person)?
In some scenarios, each participant might pick more than one number (or re-pick after a certain event). You could see a single participant effectively “spamming” picks. That changes the structure of the problem considerably.
Effects on pk:
The total number of announcements might exceed the number of participants. Let’s say each of the 25 participants picks multiple times in some sequence. The pk now references the k-th pick overall, but not necessarily the k-th participant.
The chance of a repeat depends on how often each participant re-picks and whether or not participants wait to see previous picks before choosing again (if that is even allowed).
Pitfalls and Edge Cases:
Sequential picking by the same person: If a person picks multiple times in close succession, they might be more likely to repeat a number they themselves announced earlier, especially if picks are not forced to be distinct.
Dependency on re-picking strategy: If the participant tries to “avoid” or “seek” their previous picks, we create correlation structures that break the simple independence assumption.
Does the result depend on whether people learn about all previous picks before choosing?
If participants only hear partial announcements or if they are not paying attention, their picks might still behave like independent draws. But if each participant is fully informed about prior picks, some might consciously or subconsciously be influenced.
Effects on pk:
Complete knowledge and rational players: They might attempt to optimize their probability of winning, e.g., intentionally picking a known existing number if they want to secure the bonus and if they happen to go at a certain time. This breaks the i.i.d. assumption thoroughly.
Partial knowledge: If a participant only knows some subset of previously announced numbers (e.g., they missed the first 5), the probability distribution for their picks shifts but not necessarily in a straightforward way.
Pitfalls and Edge Cases:
Game theory: In a game-theoretic sense, each participant might adopt a strategy to maximize their personal chance of winning the bonus, which could cause earlier collisions or sometimes cause participants to wait for a collision to occur from someone else. The overall distribution of pk could differ drastically from the naive calculation.
Real-world scenario: Could ties or near-simultaneous announcements affect who “wins”?
In some real-world processes, multiple announcements might come in at nearly the same time, or there might be a slight delay in hearing them.
Effects on pk:
Simultaneous collision: If two participants’ announcements are effectively simultaneous, the notion of “first repeat” can become ambiguous. There might need to be a tiebreaker rule or an assumption about the exact ordering.
Technical glitch or latency: In an online environment, one participant’s announcement might arrive a fraction of a second later even though they said it earlier. The order of announcements might differ from real-time order, adding unpredictability.
Pitfalls and Edge Cases:
Defining who wins: A strict ordering must be enforced or a fallback rule specified. Without it, the entire concept of pk (the k-th announcement) loses clarity.
Implementation details: If this process runs on a distributed system, network latencies or asynchronous updates might reorder messages, potentially skewing the actual index of the repeated pick.