ML Interview Q Series: Calculating Double-Headed Coin Probability Using Bayesian Inference After Ten Heads

May 28, 2025

Browse all the Probability Interview Questions here.

There's a jar containing 1000 coins, with 999 of them being fair and 1 that has heads on both sides. You pick one coin at random from the jar and flip it 10 times. After seeing 10 heads in a row, what is the probability that this chosen coin is the double-headed one, and what is the probability the next flip will also be heads?
</h2>

Connect with me on X (Twitter)

Comprehensive Explanation

Bayesian reasoning is the most straightforward way to address this problem, where we update our prior belief about which coin we may have picked in light of observing 10 heads in a row.

Probability That the Coin Is Double-Headed

Initially, before flipping, the chance of having picked the double-headed coin is 1/1000, and the chance of having picked a fair coin is 999/1000. After we observe 10 heads consecutively, we apply Bayes' Theorem to revise these probabilities.

Below is the central Bayesian formula we are applying:

Where the terms on the right-hand side are:

P(10 Heads | Double-Headed Coin) = 1 because a two-headed coin always shows heads.
P(Double-Headed Coin) = 1/1000 because that's the prior probability of choosing the double-headed coin.
P(10 Heads) is the total probability of seeing 10 heads in a row from either coin.

We compute P(10 Heads) by considering the two scenarios (picking the double-headed coin or a fair coin) and weighting by their respective probabilities:

P(10 Heads) = [P(Double-Headed Coin) * P(10 Heads | Double-Headed Coin)] + [P(Fair Coin) * P(10 Heads | Fair Coin)]

We know:

P(Double-Headed Coin) = 1/1000
P(Fair Coin) = 999/1000
P(10 Heads | Double-Headed Coin) = 1
P(10 Heads | Fair Coin) = (1/2)^10 = 1/1024

Hence:

P(10 Heads) = (1/1000 * 1) + (999/1000 * 1/1024)

Substitute this into the numerator and denominator of Bayes' Theorem:

Posterior = [ (1/1000) * 1 ] / [ (1/1000) * 1 + (999/1000)*(1/1024) ]

Simplifying this fraction shows the probability of having the double-headed coin after observing 10 heads is approximately 1024 / (1024 + 999) which is around 0.506.

Probability That the Next Toss Is Also Heads

After updating our belief that the coin is double-headed with probability 0.506 (approximately) and a fair coin with probability 0.494 (approximately), the probability of seeing a head on the next toss is computed by taking a weighted average:

Probability of Next Head = [Posterior Probability of Double-Headed Coin] * 1

[Posterior Probability of Fair Coin] * 0.5

= 0.506 * 1 + 0.494 * 0.5 ≈ 0.506 + 0.247 ≈ 0.753

Hence there is about a 75.3% chance that the next toss will also show heads.

Potential Follow-Up Questions

If we had tossed the coin 11 times and observed 11 heads, how would that change the calculations?

To handle 11 heads in a row, we would again apply Bayesian reasoning. The prior probabilities remain the same (1/1000 for the double-headed coin vs 999/1000 for a fair coin), but now P(Heads | Fair Coin) would be (1/2)^11 = 1/2048. The update step remains structurally identical—only the likelihood changes to reflect 11 heads instead of 10. Numerically, this would increase our confidence that the coin is the double-headed one.

Would the result change if there was a different prior distribution of coins in the jar?

Yes. If there were more or fewer double-headed coins, we would change P(Double-Headed Coin) accordingly. This directly affects the posterior probability through Bayes' Theorem. A larger number of double-headed coins in the jar makes it more probable we picked one initially, so observing consecutive heads would yield a higher posterior probability that the coin is double-headed.

How do we handle a scenario where there might be coins with different biases instead of just fair or double-headed?

In a more general setting, we sum over all possible coin types and their biases. For instance, if the jar contained multiple biased coins, you would compute the overall probability of 10 heads by taking a weighted sum of all possible bias probabilities times their prior probabilities. The main difference would be in P(10 Heads | coin type), which would be p^10 if the coin's bias is p for heads.

Is there a direct computational way to verify these probabilities?

Yes, you could simulate a large number of random picks and flips programmatically:

import random

def simulate(num_trials=10_000_000):
    double_sided_count = 0
    next_head_count = 0
    for _ in range(num_trials):
        # Pick a coin
        coin_type = 'double' if random.random() < 1/1000 else 'fair'

        # Toss it 10 times
        heads_count = 0
        for _ in range(10):
            if coin_type == 'double':
                heads_count += 1
            else:
                if random.random() < 0.5:
                    heads_count += 1

        if heads_count == 10:
            # We have observed 10 heads, check if it's double sided
            if coin_type == 'double':
                double_sided_count += 1

            # Check the outcome of the next toss
            if coin_type == 'double':
                next_flip_head = True
            else:
                next_flip_head = (random.random() < 0.5)

            if next_flip_head:
                next_head_count += 1

    return double_sided_count, next_head_count

ds_count, nh_count = simulate()
print("Probability coin is double-headed after 10 heads:", ds_count / (ds_count + (num_trials - ds_count)))
print("Probability next toss is heads after 10 heads:", nh_count / (ds_count + (num_trials - ds_count)))

This code (though simplified and not optimized) illustrates a Monte Carlo approach. By counting how often the coin is double-sided among all trials that yield 10 heads, and how often the next toss is heads under the same condition, the simulation would approximate the analytical answer.

What if the jar had more coins, say 10,000 coins, but still only 1 double-headed coin?

Increasing the total number of coins in the jar while keeping only 1 double-headed coin makes its prior even smaller. After observing 10 heads, the posterior probability of having the double-sided coin is still found via Bayes' Theorem, but the initial prior of picking the double-sided coin would be 1/10000 instead of 1/1000. The net effect is that we would need to see more consecutive heads to achieve the same level of confidence that the picked coin is double-headed.

By considering all these scenarios, we see that Bayesian thinking and careful breakdown of likelihoods and priors allow us to adapt to various real-world changes in problem assumptions.

Below are additional follow-up questions

If we slightly miscount the number of heads (due to human error), how does that affect the probability updates?

When you rely on observed outcomes, any error in counting heads and tails can distort the Bayesian update. Suppose there's a small chance that you mistake a tail for a head or vice versa. This modifies the likelihood term in Bayes’ Theorem. Instead of treating P(10 heads | fair coin) as (1/2)^10, for example, you must factor in the probability of miscounting. You might end up with an “effective” probability of heads that is slightly larger or smaller than 1/2 for a fair coin, depending on the error rate. The posterior probability that the coin is double-headed may be either inflated or deflated, depending on how often such errors occur.

The subtle pitfall is that even a small systematic error (like frequently miscounting heads more than tails) can introduce a bias in your inference, potentially leading you to conclude the coin is double-headed more often than warranted. In real-world scenarios, it may be necessary to perform repeated experiments or independent verifications to mitigate miscounting risks.

What if the fair coins are not perfectly fair but have a slight bias, say p heads = 0.51?

In real coin tosses, coins might have a mild bias, so p heads for a “fair” coin may deviate slightly from 0.5. When that happens, your likelihood term changes from (1/2)^10 to (0.51)^10 (for instance). This increases the probability of seeing many heads from a nominally fair coin, making it a bit trickier to differentiate between a slightly biased coin and a truly double-headed coin.

The main effect is on P(10 heads | fair coin). If p = 0.51, that term is (0.51)^10. This is larger than (1/2)^10, so the posterior probability of having picked the double-headed coin will drop slightly compared to the assumption of a perfectly fair coin. If the bias is strong enough, it becomes less obvious whether a streak of heads is truly indicative of a double-headed coin or just a highly biased coin. This underscores the importance of quantifying all possible biases in a Bayesian framework.

How would the conclusion change if we discover there might be a tiny fraction of coins that are double-tailed as well?

If there is a possibility (however small) that a coin is double-tailed, you must include that scenario in the overall probability model. For instance, let q be the fraction of coins that are double-tailed. Then your prior distribution of coin types expands to three categories: fair, double-headed, and double-tailed. Observing 10 heads makes the double-tailed scenario extremely unlikely (practically zero in terms of probability of generating heads), but it still impacts how the probabilities normalize because it competes with the fair coin scenario for the prior mass. Hence:

P(10 heads) = P(fair) * P(10 heads | fair) + P(double-headed) * P(10 heads | double-headed) + P(double-tailed) * P(10 heads | double-tailed)

Since P(10 heads | double-tailed) = 0, that term might vanish. However, if misclassifications or observational errors are possible, even double-tailed coins might appear to have heads occasionally if we are uncertain about our observations. Then the calculation becomes more complex, emphasizing how additional coin types reshape your posterior probabilities.

The pitfall is failing to account for all relevant coin types. If you ignore double-tailed coins but they exist, you might incorrectly inflate or deflate your estimate for how likely the coin is double-headed because your model is missing a key competitor state.

What are the computational constraints if we try to extend this to a large number of coin types and toss outcomes?

In large-scale scenarios with many coin biases or coin types, enumerating every possibility quickly becomes computationally expensive. A straightforward approach involves summing over all coin types to compute P(10 heads). If you have thousands or millions of different biases (for example, a distribution over p heads for each coin), the summation can be large.

One workaround is to use approximation methods: • Monte Carlo sampling: Randomly sample a coin type from the prior distribution, simulate the tosses, and update your counts. • Variational inference or Markov chain Monte Carlo (MCMC): If the distribution of coin biases is continuous, you can use these techniques to approximate posterior distributions.

A subtle pitfall is that naive summation over a vast state space can become infeasible. Another challenge is ensuring your sampling or approximate methods converge to a stable solution. In an interview or real project, showing awareness of these computational methods and limitations indicates an understanding that Bayesian updates in complex systems can exceed straightforward computations.

How does the result change if we observe a large number of total tosses but the fraction of heads is close to 1?

If, for instance, you see 100 tosses and 99 are heads, the direct approach remains the same: use Bayes’ Theorem with the likelihood term (p heads)^(number of heads) * (1 - p heads)^(number of tails) if the coin is assumed to be fair or slightly biased. For the double-headed coin, the likelihood remains 1 if all outcomes are heads (and 0 if any tail is observed).

In the scenario of 99 heads and 1 tail out of 100 flips, the likelihood of the double-headed coin is zero, because you cannot get a tail from that coin. This drastically changes the posterior: the double-headed coin scenario is ruled out entirely. Meanwhile, for a fair or slightly biased coin, the probability of exactly 99 heads out of 100 is something you can compute using the Binomial distribution.

The pitfall is forgetting that a single tail essentially eliminates the double-headed coin possibility (assuming perfect observation). Some might erroneously still consider the double-headed coin scenario plausible even after a tail shows up, neglecting the fact that the coin physically cannot show tails if it really has two heads.

Could partial information about each toss (e.g., we only see the outcome of 8 flips reliably, the others are hidden) still be used for an update?

Yes. Bayesian inference can handle partial or missing data by computing the probability of the observed portion. For instance, if you only see 8 flips (all heads) and you have no information about 2 flips (they might be lost data), the likelihood for a fair coin is (1/2)^8, and for the double-headed coin it is 1 for those 8 flips. Because the remaining flips are unknown, you effectively marginalize over the possible outcomes of those 2 flips.

The potential pitfall is incorrectly ignoring the 2 unknown flips altogether or assuming they were tails or heads without justification. Instead, you must integrate over all possible results of the missing flips, weighted by their probabilities. Failing to do so can incorrectly skew your posterior belief about which coin you have.

What if the coin tosses become correlated (e.g., due to real-world physics or an unstable flipping mechanism)?

A fundamental assumption in the basic analysis is independence of tosses for a fair coin. In reality, the way a coin is flipped can introduce dependencies (like a “soft toss” might yield heads more often). If the independence assumption is broken, the model P(10 heads | fair coin) = (1/2)^10 is no longer strictly correct.

To handle correlation, you need a more detailed model of how the coin flips lead to heads or tails. For instance, you might incorporate the notion that repeated “soft tosses” have a higher chance of landing heads. This changes the likelihood term drastically. A naive approach might overestimate or underestimate the probability of consecutive heads if it clings to the independence assumption.

The hidden pitfall is that many real-world processes break the i.i.d. (independent and identically distributed) assumption. Overlooking this can lead to inflated confidence in improbable sequences of outcomes or overshadow real biases in the flipping mechanism.

What if the jar is used multiple times by different experimenters, each doing a different number of flips?

When multiple experimenters pick coins from the same jar at different times, we can model each pick as a new random draw from the jar. If the coins are replaced, each experiment remains independent. If the coins are not replaced or if we keep track of how many heads/tails each coin yields across different trials, this updates our knowledge about the jar’s composition.

For instance, if an experimenter picks a coin, flips 10 heads, and suspects it might be double-headed, they might put it back into the jar (or set it aside). Over repeated experiments, you build a posterior distribution over how many double-headed coins remain in the jar (which might still be just one, but you become more confident about which physical coin that is). If you do not replace the coin, the jar composition changes: it now contains fewer coins, and possibly you already “found” the double-headed coin. This drastically alters the next experimenter’s prior probabilities.

A subtle pitfall is ignoring the sequential nature of repeated experiments. If the double-headed coin was presumably identified in a previous experiment (and set aside), then new experimenters should factor that into their prior. Failing to do so can lead to confusion about the posterior probabilities in subsequent trials.

What lessons apply if the question involved not 10 coin flips but a continuous measurement scenario?

Bayesian inference generalizes well beyond coin tosses. If you had a continuous measurement—say you suspect one jar sample might have a different chemical concentration than the rest—then after obtaining multiple observations (e.g., repeated tests of concentration), you would still apply the same Bayesian machinery but with a continuous likelihood function. Observing data that strongly deviates from the normal concentration distribution would significantly shift your belief that the sample is “different.”

A pitfall in a continuous setting is incorrectly assuming discrete outcomes or using a likelihood function that does not reflect real variability. The logic remains the same: you start with a prior, gather evidence, and update your belief accordingly, ensuring your likelihood captures the distribution of real measurements.

Rohan's Bytes

Discussion about this post