ML Interview Q Series: Calculating Lottery Winner Probability for Half-Sold Tickets via Poisson Thinning

May 07, 2025

Browse all the Probability Interview Questions here.

Question: A lottery organization distributes one million tickets every week. Each ticket has a visible 6-digit number on one end and a hidden 6-digit number on the other end, covered by scratch-away paint. If the hidden number matches the visible number, the ticket is a winner. No two tickets have the same visible number or the same hidden number. In a particular week, only half of the tickets are sold. What is the probability that exactly r of the sold tickets are winners, for r = 0, 1, 2,... ?

Short Compact solution

The probability that exactly r tickets (out of the half-million sold) are winners is given by the Poisson distribution with parameter 1/2. Hence,

Comprehensive Explanation

Intuitively, the total number of winning tickets among the one million printed has a very small probability for any particular ticket to match its hidden number. Because the tickets are created with unique visible and hidden codes, the chance for each ticket to be its own match is about 1 in 1,000,000. Over a large number of tickets (one million), such a rare-event process is commonly approximated by a Poisson distribution with mean 1. This tells us that, on average, one winning ticket exists out of a million, and the distribution for the total number of winners among the entire million tickets is approximately Poisson(1).

However, only half of those one million tickets are sold. We can treat the event of having a winning ticket among those sold as selecting from the (on average) 1 winning ticket in the population of a million. From another perspective, each winning ticket has a 1/2 chance of actually being sold (since half the printed tickets are sold). That scenario—starting with Poisson(1) for the total number of winners, and then “thinning” each winner with probability 1/2—yields a new Poisson distribution with parameter 1/2 for the sold tickets.

Thus, the number of winners among the half-million sold is Poisson with mean 1/2, leading to the probability of exactly r winners being:

Here:

r is the number of winning tickets among those sold.
1/2 is the “thinned” mean, because the original Poisson mean (1 winner expected in a million) is multiplied by the 1/2 chance that a winning ticket is sold.
e^{-1/2} (1/2)^r / r! is the standard Poisson formula for parameter 1/2.

This result cleanly captures both the low probability of matching and the fact that only half the tickets are actually in circulation.

Possible Follow-up Questions

How do we verify the expected number of winners is 1/2 for the sold tickets?

Once we accept that the total number of winners among all one million tickets follows (approximately) Poisson(1), the expected number of total winners is 1. Each one of those winners has a 50% probability of being sold (because only half the tickets are purchased). Thus the expected number among the sold portion is 1 multiplied by 1/2, which is 0.5. Poisson “thinning” results in a Poisson distribution with parameter 1/2.

What if some other fraction f of the tickets were sold?

If, for instance, only a fraction f of the tickets are sold, each winning ticket has probability f of being among the sold group. Therefore, the number of sold winners is Poisson(λf) when the total number of winners is Poisson(λ). In this lottery scenario, λ = 1 for the entire million tickets (on average), so if half are sold (f=1/2), the parameter becomes 1/2. For a fraction f, it would become Poisson(1·f).

Why is the Poisson approximation valid here?

When you have a large number of tickets n=1,000,000 and a very small probability p=1/n=1/1,000,000 that any individual ticket is “its own match,” the number of matches follows a Binomial(n, p). But for large n and small p with np=1 fixed, Binomial(n, p) can be approximated by Poisson(np)=Poisson(1). This is a common approximation in probability theory known as the Poisson limit theorem.

How do we compute this probability in Python?

Below is a simple snippet that uses math or scipy.stats to compute Poisson probabilities for given r:

import math

def poisson_pmf(r, lam):
    # lam is the Poisson parameter
    return (math.exp(-lam) * (lam ** r)) / math.factorial(r)

# Example usage for lam=0.5, r from 0 to 5
lam = 0.5
for r in range(6):
    prob = poisson_pmf(r, lam)
    print(f"r={r}, Probability={prob}")

This code calculates P(X=r) for X ~ Poisson(0.5) using the direct formula. In a practical setting, you might instead use libraries like scipy.stats.poisson for numerical stability and convenience.

Does the Poisson(1/2) distribution also describe variance?

Yes. A Poisson random variable with parameter λ has mean λ and variance λ. So for λ=1/2, the variance is also 1/2. That means in this lottery scenario, we expect on average 0.5 winners in the sold half, with the standard deviation being sqrt(0.5) ≈ 0.707.

What if the generation of ticket numbers had some correlation?

The derivation heavily relies on the assumption that each ticket is printed with a unique visible number and a unique hidden number, both chosen in a way that keeps each pairing equally likely to match or not match. If correlations creep in (for instance, some partial deterministic rule or known pattern in the hidden numbers), the match probability might depart from the ideal 1/n scenario. In that case, the Poisson approximation with λ=1 might no longer hold, and a more elaborate analysis would be required to determine the correct distribution.

Could we extend this analysis if not exactly one million tickets are printed?

Yes. The core logic is: (1) The total count of winners (over some large number of tickets) is small compared to the total, (2) each ticket’s chance of being a “self-match” remains about 1/total_tickets, and (3) only a fraction of tickets are sold. These three conditions typically yield a Poisson distribution for large numbers. If you had a different total of N tickets, but still with each ticket uniquely assigned visible and hidden numbers, you would shift the parameter accordingly. The key idea remains a low-probability, large-number scenario leading to a Poisson result.

Below are additional follow-up questions

What if not all the tickets have an equal chance of getting sold? Could that affect the probability distribution for r?

If certain tickets are more likely to be purchased than others—maybe due to perceived “luckiness” of their visible numbers—the thinning argument changes. Poisson thinning assumes each winning ticket has the same probability 1/2 of being sold, independently of others. If some subsets of tickets have a higher chance of sale, while others are less likely to be sold, this breaks the simple thinning assumption.

In that scenario, one could model the probability that any particular ticket is sold as varying by ticket, say p_i for ticket i. Then, even if the total number of tickets sold still averages 500,000, the distribution of which specific tickets are sold is no longer uniform. To analyze this rigorously, we might need to apply a generalized thinning argument. For instance, if the total number of winners is still Poisson(1), but each winner i has its own p_i of being sold, then the resulting sold winners might not be strictly Poisson-distributed unless all p_i are identical. Real-world issues arise if buyers systematically choose certain “lucky” visible digits.

How does the probability change if each ticket can be purchased multiple times by the same buyer, or if group purchases occur?

The assumption in the original problem is typically that each ticket is uniquely bought or not bought, but in real scenarios, group purchases or single buyers taking large blocks of tickets could occur. If a single buyer purchases multiple tickets, that usually does not change the Poisson assumption so long as each ticket’s underlying chance of being a winner remains independent, and the fraction of tickets sold remains about 1/2. The distribution for r winners among sold tickets only changes if there is some correlation introduced by group buying.

A subtle edge case is if the lottery rules allow “repeat printing” or reissuing the same ticket ID for different buyers (which was not assumed in the original problem). Then the uniqueness assumptions for visible and hidden numbers might not hold. That scenario would require a separate analysis.

Could a large prize or marketing campaign influence the distribution from week to week?

Yes. If, for example, the lottery runs a special promotion that draws more attention or changes which tickets get sold, the fraction of sold tickets could differ from 1/2. Moreover, there might be a scenario where the lottery deliberately adjusts the printing process (e.g., intentionally placing more matching pairs) to generate publicity. The Poisson model relies on the assumption that the event of a match is purely random with probability 1 in a million for each ticket. If that assumption is violated, the distribution might no longer be Poisson(1) for the total winners. Instead, it might have a higher or lower mean, or even a different shape, depending on how the lottery modifies the printing or marketing approach.

What if the data indicates that the average number of winners in sold tickets consistently deviates from 0.5 over time?

In practice, lotteries collect empirical data on how many winning tickets are discovered each week. If you observe that the long-term average of winners among sold tickets significantly exceeds or falls below 0.5, this suggests that either:

The randomization process is not uniform (some hidden correlation or system flaw), or
The fraction of tickets sold is not truly 1/2 each week on average, or
There is a more complicated phenomenon at play (e.g., some unscrupulous behavior, or certain patterns in the ticket printing).

In such cases, you would re-estimate the Poisson parameter λ based on observed data. Instead of assuming λ=1 for the total printed and 1/2 for the sold subset, you would gather a historical record of winners and fit the best Poisson (or other distribution) to the data.

Do we need to worry about a maximum of 500,000 winners, since only 500,000 tickets are sold?

Technically, yes, the event “500,001 or more winners among sold tickets” is impossible because there are only 500,000 sold tickets. Yet the Poisson distribution would assign an incredibly small probability to such a large number of winners, effectively near zero. Because Poisson is a limit approximation, the practical probability that nearly all sold tickets match is astronomically small—though not strictly zero under the theoretical model. For all realistic ranges of r (i.e., 0 through a few), the Poisson(1/2) model is accurate. The boundary condition r>500,000 rarely concerns us because that probability is vanishingly small.

What if the visible numbers and the hidden numbers do not have the same exact uniform distribution over 10^6 possibilities?

If there is a bias in generating the digits—for example, some six-digit combinations occur more frequently than others—this might shift the distribution. In the ideal setting, each visible and hidden code is equally likely among the million possible codes, and each is unique across tickets. If the distribution is skewed, the chance of a random match might not remain precisely 1/1,000,000. The model might be more complicated, and one could get a slightly different expected count of winners. However, the binomial-to-Poisson approximation can still hold if the total expected number of matches is close to 1, and independence can be reasonably assumed. Slight biases might not drastically change the end result, but in a real-world scenario, these potential biases would be investigated or tested for.

Is it possible to combine tickets across multiple weeks to analyze a long-term trend in winners?

Yes. Over multiple weeks, you could consider the distribution of total winners across T weeks. If each week yields a Poisson(1/2)-distributed count of winners (assuming the conditions remain stable), then the sum of these weekly counts over T independent weeks follows a Poisson with parameter (T * 1/2). This helps detect anomalies or biases over time, because you can compare the observed total winners with the predicted Poisson(T/2). If there is a consistent discrepancy, it signals that one or more assumptions (randomness of matching, fraction sold, etc.) may be invalid.

Could extreme outliers in a single week’s result invalidate the Poisson assumption?

Outlier weeks—say, observing five or six winners in a single week, or zero winners for many weeks in a row—do not necessarily invalidate the Poisson assumption. A Poisson distribution with mean 0.5 certainly allows such outcomes, albeit with low probability. However, if outliers occur more frequently than the Poisson probabilities would predict, this suggests the presence of overdispersion. Overdispersion means the variance of the observed data over time exceeds the Poisson theoretical variance (which equals the mean). When overdispersion is detected, one might switch to a more flexible model, such as a negative binomial or a Poisson mixture, to capture the extra variability. Analyzing real data is crucial for confirming or refuting the pure Poisson assumption.

What considerations arise if the lottery is required by law to pay out a minimum number of winners each week?

In some jurisdictions, there could be a legal or contractual requirement that ensures a minimum number of winning tickets. If the lottery manipulates the printing of hidden codes to guarantee at least one match each week, that changes the distribution entirely. Instead of a Poisson(1) distribution for the total printed tickets, you might have a distribution shifted by 1 or more guaranteed matches. Then, among the sold tickets, you would have a distribution that reflects thinning but still ensures at least one winner. This modifies the probability of zero winners to zero, and redefines the distribution for r≥1. In that case, the standard Poisson(1/2) formula no longer applies. The new model might be a “zero-truncated Poisson” or some custom mixture that accounts for guaranteed winners.

Rohan's Bytes

Discussion about this post