ML Interview Q Series: Identifying a Double-Headed Coin Using Bayesian Inference After Repeated Heads.

May 07, 2025

Browse all the Probability Interview Questions here.

A box contains 10,000 coins. One of the coins has heads on both sides but all the other coins are fair coins. You choose at random one of the coins. Use Bayes’ rule in odds form to find the probability that you have chosen the two-headed coin given that the first 15 tosses all have resulted in heads. What is the answer when you would have obtained 25 heads in a row in the first 25 tosses?

Connect with me on X (Twitter)

Short Compact solution

For n=15, this probability is approximately 0.7662. For n=25, this probability is approximately 0.9997.

Comprehensive Explanation

Overview of the Problem

We have a large box with 10,000 coins. Among these:

Exactly one coin is double-headed (heads on both sides).
The other 9,999 coins are fair, meaning each of those has a 0.5 probability of landing heads on any toss.

We randomly select one coin from the box. We then flip that coin n times and observe that all n tosses come up heads. We want to compute the probability that the chosen coin is the double-headed coin, conditioned on the evidence that all n flips are heads.

Setting up the Bayesian Inference

We denote:

H: the event that the chosen coin is the two-headed coin.
overline{H}: the event that the chosen coin is one of the 9,999 fair coins.
E: the event that the first n tosses are all heads.

Our ultimate goal is to find P(H|E).

Prior Probabilities

P(H) = 1/10000 because there is exactly 1 special coin out of 10,000.
P(overline{H}) = 9999/10000 because the remaining 9,999 coins are fair.

Likelihood of Observing n Heads

Likelihood if we have chosen the double-headed coin (H): The probability of getting a head on any toss is 1. So the probability of observing n heads in n tosses is 1^n = 1.
Likelihood if we have chosen a fair coin (overline{H}): The probability of getting heads on any single toss is 0.5. Therefore, the probability of getting n heads in n tosses is (0.5)^n.

Using Bayes’ Rule in Odds Form

Bayes’ rule in odds form states: P(H|E) / P(overline{H}|E) = [P(H) / P(overline{H})] * [P(E|H) / P(E|overline{H})].

Plugging in the values:

P(H) / P(overline{H}) = (1/10000) / (9999/10000) = 1/9999.
P(E|H) / P(E|overline{H}) = 1 / (0.5^n) = 2^n.

Hence,

Numerical Results

For n=15 We get P(H|E) = 2^15 / (2^15 + 9999) = 32768 / (32768 + 9999) ≈ 0.7662.
For n=25 We get P(H|E) = 2^25 / (2^25 + 9999) = 33554432 / (33554432 + 9999) ≈ 0.9997.

Hence, once you observe 25 consecutive heads, the probability that you picked the double-headed coin becomes extremely high.

Simple Python Code to Compute This

Below is a short Python snippet that illustrates how one could compute these probabilities:

def probability_double_headed(n):
    import math
    numerator = 2**n
    denominator = numerator + 9999
    return numerator / denominator

print("Probability with 15 heads:", probability_double_headed(15))
print("Probability with 25 heads:", probability_double_headed(25))

This code calculates the exact fractions and prints out the probabilities for n=15 and n=25 heads in a row.

Potential Follow-Up Questions

How Does This Generalize to a Different Number of Special Coins?

If instead of 1 double-headed coin, there were s double-headed coins among a total of T coins, then the prior P(H) would be s/T, and P(overline{H}) would be (T - s)/T. You could re-derive the same formula:

P(E|H) = 1 for a double-headed coin.
P(E|overline{H}) = (0.5)^n for a fair coin.

Then,

P(H|E) = [ s/T * 1 ] / [ s/T * 1 + (T - s)/T * (0.5)^n ] = [ s ] / [ s + (T - s) (0.5)^n ].

What If We Observe Tails at Some Point?

If, at any toss, you observe a tail, then it is immediately certain that the coin cannot be double-headed. In that case, P(H|E) = 0 as soon as a single tail appears. So the interesting scenario is only when you see heads on every toss up to some count n.

How Many Tosses Are Usually Needed to Be “Almost Sure”?

“Almost sure” typically means a probability extremely close to 1. From the formula 2^n / (2^n + 9999), you can see that once 2^n is significantly larger than 9999, P(H|E) approaches 1. In practice, once n is around log2(9999) ≈ 13.29, you start getting a probability comfortably above 0.5. As n increases further (like 20 or more), the probability converges quickly to 1.

Could There Be Any Practical Edge Cases?

In real-life applications, no coin is truly perfect, so there might be a scenario where a “two-headed” coin is actually just extremely biased (e.g., 0.9999 probability of heads). However, from a theoretical standpoint and strictly following the problem statement where one coin is double-headed, the probability formula is exact for all n. If you incorporate coin imperfections, you’d need to adjust the likelihood terms to reflect a biased model.

Could We Use a Beta Prior Instead of a Simple Prior on 0.5?

If we wanted to be Bayesian about the potential bias in the fair coins themselves (rather than believing they are exactly 0.5 heads/tails), we might place a Beta distribution on the fairness parameter. Then, after observing heads, we would update that Beta prior. For the double-headed coin, we would treat it as a degenerate distribution at probability = 1. This extends the problem but is more complicated than the original question, which assumes a strict fair-coin scenario.

All these considerations show that the fundamental concept—comparing likelihoods with Bayesian updating—remains the same: a prior weighting is multiplied by the likelihood ratio for the evidence, resulting in the posterior probability.

Below are additional follow-up questions

What if we only flip the coin a random number of times until we see the first tail, and then stop? How does that affect the posterior probability?

When we stop flipping as soon as a tail appears (or when we decide on some stopping criterion that depends on the outcome itself), we are dealing with a form of adaptive sampling. In such scenarios:

If even a single tail is observed, we know immediately that we do not have the double-headed coin. So at the exact moment we see a tail, the posterior probability that we have the two-headed coin becomes 0.
If we manage to flip the coin k times without ever seeing a tail and then stop (possibly by choice or because we reached a time limit), we have collected the evidence E: “k heads in a row, no tails observed, but we do not continue flipping.” We would then apply the same Bayesian update using k as the number of heads in a row.

Hence, the posterior probability after k heads (and stopping before seeing a tail) is 2^k / (2^k + 9999). The adaptive nature of stopping does not change the fundamental formula—our evidence is simply that we saw k consecutive heads but no tails up to that point. If we stop flipping for practical reasons (like running out of time), the probability that the coin is double-headed is determined by how many heads we have seen consecutively so far.

A potential pitfall is if the decision to stop flipping is correlated with some hidden factor that influences the coin’s performance (for example, if we suspect the coin might be rigged and keep flipping only while we see heads). That could bias the interpretation of results. However, in the idealized scenario where we merely watch until the first tail appears or we have a predetermined stopping point, the analysis remains straightforward.

What if there are several coins in the box with unknown biases, not just one special coin with probability 1 for heads?

This scenario generalizes the problem. Suppose each coin has some bias p_i for heads. We do not necessarily know p_i for each coin. One approach is:

Assign a prior distribution for each coin’s bias (for example, a Beta distribution for each coin if they are all drawn from some population).
Include the two-headed coin as a special case with bias p = 1.
The posterior probability that our chosen coin has a particular bias p_i given observed heads is proportional to prior(p_i) * p_i^n.
For the double-headed coin, the likelihood contribution is 1^n = 1.

A key subtlety here is that if there are multiple unknown biases, we have to integrate over all possible values of each bias for the other coins. This can get complicated quickly, requiring more advanced Bayesian inference techniques (like Markov Chain Monte Carlo or variational inference). One major pitfall is overlooking how quickly the posterior might concentrate around certain bias values if you see many heads in a row. Another is incorrectly assuming that some large sample size will always let you ignore the possibility of a near-one bias coin—when in fact, repeated heads can drastically skew the posterior toward that possibility.

How do numerical or computational issues arise for large n?

When n becomes very large (say hundreds or thousands of flips all heads), direct computation of (0.5)^n or 2^n can exceed the range of standard floating-point representations in many programming languages. This may lead to floating-point underflow or overflow. A common strategy is to work in log space to avoid these numerical issues:

Instead of computing 2^n directly, compute log(2^n) = n * log(2) and then exponentiate only at the final step (or keep everything in log form and use a log-sum-exp trick).

A pitfall is that failing to handle large exponents correctly can yield P(H|E) values numerically stuck at 1.0 (or 0.0) prematurely due to floating-point limitations, which might mask the true probabilities if one wishes to maintain more nuanced calculations.

What if after a large number of heads, we suddenly observe a small number of tails and want to estimate the probability of having the double-headed coin?

Strictly, if we see at least one tail, the probability of having a strictly double-headed coin is zero in the classic sense: a true double-headed coin cannot produce tails at all. However, in a real-world scenario where the coin might be “almost” double-headed (extremely biased) but not truly guaranteed to be heads every time, the posterior is no longer forced to zero. Instead, we would update based on some model of “almost double-headed” coin.

But under the exact problem statement: if you see one tail at any point, the posterior probability that the coin is truly double-headed is zero. A subtle real-world pitfall is if an apparent “tail” was a reading mistake, or if the coin flip didn’t land flat. Real experiments can have errors, so purely applying the theoretical result might fail to account for measurement or experimental uncertainties.

How does prior knowledge about the fairness of coins alter the result?

If you have prior knowledge that the 9,999 “fair” coins are not exactly fair but might have a distribution of biases around 0.5 (for instance, some are more likely to produce heads 0.51 of the time, others 0.49, etc.), then your prior for each “fair” coin changes. For instance, you might adopt a Beta(α, β) distribution for the bias of each coin. The double-headed coin is still a degenerate case at bias=1. Then, after seeing n heads, you apply Bayesian updating for each coin’s possible bias. The updated posterior probability that your chosen coin is the special coin with p=1 depends on integrating out the posterior for all possible biases in the “normal” coins.

A subtlety is that if your prior strongly suggests that many coins have biases slightly above 0.5, then seeing n heads in a row is not as improbable under those coins as it would be under a strict 0.5 assumption. The pitfall is to fail to incorporate that knowledge, which might lead you to overestimate the posterior that you have the double-headed coin.

Would the calculation change if we suspected there was also a coin with tails on both sides?

Suppose in addition to one double-headed coin, there is one double-tailed coin, and 9,998 fair coins. If we observe heads repeatedly, the event of having chosen the double-tailed coin is instantly zero probability after the first head. Formally:

If you observe even one head, the posterior probability of having the double-tailed coin is 0.
The presence of a double-tailed coin does not affect the calculation for the double-headed coin unless you consider extremely unlikely scenarios of partial bias or observational errors.

In the pure theoretical sense, seeing a head even once discards the double-tailed coin from the posterior. Therefore, the original formula for the double-headed scenario remains nearly the same: we just exclude the double-tailed coin after the first head. A common pitfall is forgetting to remove logically impossible hypotheses from the posterior once contradictory evidence is observed.

Could these results be used in A/B testing scenarios where we keep seeing one version “win” repeatedly?

Yes, there is a conceptual analogy in A/B testing: imagine you have two “coins” (two versions of a website), each with some unknown “bias” (conversion probability). Observing heads would correspond to conversions. If you see one version consistently outperform the other to an extreme extent, you might suspect that version is drastically better. The difference is that in typical A/B testing, both versions have uncertain but possible probabilities for “heads,” whereas in the coin puzzle, one coin is known to be a degenerate case (probability=1). A pitfall arises when data is suspiciously perfect—one must question if there is something systematically biased about how the experiment is conducted.

In practical A/B tests, we would never have a scenario where a variant truly has a 100% conversion rate, but we can still apply a Bayesian approach to update beliefs and see how likely it is that one variant is drastically better. If we keep seeing “wins,” the posterior for that variant’s conversion rate will shift accordingly. However, real-world experiments also face confounding factors, user segmentation issues, or temporal changes—unlike the simple coin model, which can lead to overconfidence if such real-world complexities are ignored.

Rohan's Bytes

Discussion about this post