ML Interview Q Series: Using Bayes' Rule to Calculate Posterior Coin Probability After n Heads.
Browse all the Probability Interview Questions here.
A box contains r+1 coins labeled i = 0, 1, …, r. Coin i lands heads with probability i/r for i = 0, 1, …, r. You choose one coin at random from these r+1 coins. Use Bayes’ rule in odds form to find the probability of having chosen coin s given that the first n tosses all came up heads.
Short Compact solution
Using Bayes’ rule in its odds form, we arrive at the posterior probability that the chosen coin is s (denote this event by H) given n heads in a row (denote this evidence by E) as:
Comprehensive Explanation
Setting up the problem
We have r+1 coins. Coin i has a probability i/r of landing heads on any toss. The coins are:
Coin 0: lands heads with probability 0/r = 0
Coin 1: lands heads with probability 1/r
Coin 2: lands heads with probability 2/r
…
Coin r: lands heads with probability r/r = 1
We choose one coin uniformly at random, so each coin has prior probability 1/(r+1) of being chosen. After this random choice, we observe n consecutive heads. We want the posterior probability that the chosen coin is coin s, where 0 <= s <= r.
Applying Bayes’ rule
Bayes’ rule, in standard form for our scenario, says:
P(Chosen coin is s | E) = ( P(Chosen coin is s) * P(E | Chosen coin is s) ) / P(E)
Here:
P(Chosen coin is s) = 1/(r+1)
P(E | Chosen coin is s) = (s/r)^n, because each toss has probability s/r of heads, and we see n heads in a row.
To find P(E), we sum over all possible coins:
P(E) = sum over i=0 to r [ (1/(r+1)) * (i/r)^n ]
Hence,
P(Chosen coin is s | E) = ( (1/(r+1))(s/r)^n ) / ( sum_{i=0 to r} [ (1/(r+1))(i/r)^n ] )
After factoring out 1/(r+1), we get:
P(Chosen coin is s | E) = ( (s/r)^n ) / ( sum_{i=0 to r} (i/r)^n )
This matches our short, compact formula.
Odds form perspective
We can also look at it in “odds form.” The odds form of Bayes’ rule for events H and not-H (or for coin s vs coin k) can be written as:
P(H | E) / P(H' | E) = [P(H) / P(H')] * [P(E | H) / P(E | H')]
Here, H is “chosen coin = s,” and H' is “chosen coin ≠ s.” One can show via algebra that this leads to the same final result:
P(H | E) = ( (s/r)^n ) / ( sum_{j=0}^r (j/r)^n )
since each coin i ≠ s contributes a likelihood term (i/r)^n in the denominator.
Interpreting the formula
The resulting posterior probability heavily depends on (s/r)^n. For large n, the ratio (s/r)^n might dominate if s is relatively large, because it suggests the coin is more likely to produce consecutive heads. Conversely, if s is small (especially if s=0), (s/r)^n = 0 for n>0, so the posterior probability that s=0 (the coin that always shows tails) will be 0 after even a single head is observed.
Example edge cases
If s=0 but you observed at least one head, the posterior is 0 for coin 0.
If s=r (the coin always shows heads), then for any number of observed heads, (s/r)^n = 1^n = 1. That term can be quite dominant in the denominator if n is large, making the posterior for coin r approach 1.
If n=0, meaning we have not observed any tosses, the posterior for any coin s is just 1/(r+1).
Practical code snippet
Below is a simple Python snippet showing how you might compute this posterior probability numerically for general r, n, and s:
def posterior_probability(r, n, s):
"""
Returns the probability that the chosen coin is s
given that we observed n heads in a row.
"""
# Numerator
numerator = (s / r) ** n
# Denominator: sum of (i/r)^n for i=0,...,r
denominator = sum((i / r) ** n for i in range(r + 1))
# Posterior
return numerator / denominator
# Example usage
r = 10
n = 3
s = 7
post_prob = posterior_probability(r, n, s)
print("Posterior probability:", post_prob)
Possible Follow-up Questions
How would the result change if we had observed both heads and tails in our sequence of tosses?
In that scenario, the likelihood for coin i would no longer be (i/r)^n but rather (i/r)^(number_of_heads)*(1 - i/r)^(number_of_tails). The posterior probability for coin s would then be:
( (s/r)^(#heads) * (1 - s/r)^(#tails) ) / ( sum_{j=0 to r} [ (j/r)^(#heads) * (1 - j/r)^(#tails) ] )
Hence, the denominator is a sum of the likelihoods of heads-tails outcomes for all coins, each weighted by the same prior 1/(r+1).
What if we do not choose the coins with equal probability?
If the prior probability of choosing coin i is not uniform, say p_i, then the posterior formula becomes:
P(coin s | data) = [ p_s * L_s(data) ] / [ sum_{j=0 to r} p_j * L_j(data) ]
where L_i(data) is the likelihood (the probability of observing the data given coin i). All the same principles apply, but we weight each likelihood by its prior p_i.
How do we handle the case r=0 or n=0?
If r=0, there is only one coin (coin 0) which always lands tails. Observing any heads is impossible.
If n=0, you have not observed any tosses, so the posterior probability of having chosen coin s remains the prior 1/(r+1).
Why is the coin labeled i=0 included if its probability of heads is 0?
In theory, coin 0 can still be drawn, but seeing heads from it has probability 0. Therefore, once you see any heads, the posterior probability of having chosen coin 0 goes to 0 for n>0. This is still a valid scenario, especially in theoretical or combinatorial settings.
Could numerical precision be a problem for large n?
Yes. When n is large, terms like (s/r)^n can become very small or very large. In practice, you may want to use log probabilities, i.e., store ln(i/r) * n and apply log-sum-exp techniques to avoid underflow or overflow issues.
Is there an intuitive interpretation of this result?
Yes. Each coin i corresponds to a different bias for heads i/r. After observing heads n times, coins with larger biases for heads get “favored” more heavily because they make seeing many heads more probable. The posterior formula simply reweights each coin’s prior by its likelihood of generating the observed outcome (n heads), normalizing over all coins.
Below are additional follow-up questions
How would the posterior probability be affected if the number of coins r+1 is very large, but only a small subset have high probability of heads?
When r+1 is large, you have a wide range of possible head probabilities i/r, from 0 up to 1 in very small increments. Intuitively, after observing many heads in a row, most likelihood weight will concentrate on coins with high bias for heads—particularly those i values close to r. However, if you have a large r, you might encounter numerical precision issues. For instance:
If n is large, (i/r)^n can become extremely small for moderate i. Those tiny values contribute very little to the sum in the denominator, effectively making the large-i terms dominate.
On a computational level, summing many small terms might introduce floating-point underflow or numerical instability. Using log-likelihood sums (log-sum-exp) is a robust way to handle this.
Conceptually, for practical finite sample sizes n, the posterior might still put non-negligible weight on coin i even if i/r is not extremely close to 1, but is significantly above 0. You should analyze how quickly (i/r)^n decays for values of i/r less than 1.
A subtle issue is that for extremely large r, i/r can capture very fine gradations of bias. Practically, you may see that among those many coins, only those with i/r near the observed frequency of heads get significant posterior weight.
How do we extend this analysis if each coin might not have a fixed bias, but can change bias over time?
In the standard setup, coin i has a fixed, unchanging bias i/r. If we suspect the coin’s bias changes over time (non-stationary process), then our assumptions about i/r being constant during all tosses become invalid. Potential approaches:
Time-varying model: Suppose that at each toss, the bias might shift. This is far more complex because each coin i is no longer strictly “the coin with probability i/r” throughout all n tosses.
Adaptive Bayesian update: You might introduce a state-space model or hidden Markov model where you assume the coin’s bias can transition from i/r to j/r with some probability. Then you would apply a more advanced Bayesian inference technique (e.g., particle filtering) to track the evolving distribution over the bias.
Practical pitfall: If you ignore the possibility of changing bias, your inference might be misleading. After enough data, you might incorrectly converge on a coin with a high or low bias that does not match the changing frequency of heads.
What happens if we have partial or uncertain outcomes (e.g., a flip lands on the floor or is unreadable)?
If some flips are invalid or uncertain, you cannot treat them as definite heads or tails. For instance, suppose some fraction of flips produce an outcome that cannot be confidently classified as heads or tails. In that case:
Modeling partial information: You can incorporate the probability that a coin i would produce an “ambiguous” result. For instance, you might guess that a coin with bias i/r for heads has some small probability of producing an unreadable outcome. Or if the unreadable outcome is purely observational noise, you’d have to model the chance of not seeing the result correctly.
Likelihood computation: With uncertain outcomes, the likelihood for coin i becomes a mixture of (i/r)^(#heads)*(1 - i/r)^(#tails)*something^(#ambiguous). The exact form depends on your assumptions about how ambiguous data is generated.
Posterior normalization: You would still sum over all i in the denominator, but each summand would account for the uncertain flips. This is more involved and can lead to a less obvious closed-form solution.
Is it still valid to assume independence across tosses if the flipping mechanism might introduce correlation (e.g., coin flips physically influencing each other)?
In typical coin-flip models, we assume each flip is an independent Bernoulli trial. However, real-world coin flips might have correlations (for example, if the coin’s spin speed or initial orientation repeats, or the environment systematically biases flips). If flips are correlated:
Violation of independence: The probability of the nth head might depend not just on i/r but also on outcomes of previous flips (and how you physically flipped the coin).
Bayes’ rule in the correlated scenario: Instead of (i/r)^n, your likelihood P(E | coin i) must be replaced by the joint probability of observing the specific sequence of heads under coin i’s correlation structure. That could be significantly more complicated, requiring a detailed model of correlation.
Pitfall: Ignoring correlations might lead to overconfidence in your inference because you effectively “double count” or “triple count” evidence that might not be truly independent. This typically inflates the likelihood for coins with biases near the observed frequency, causing the posterior to converge more quickly than warranted.
How would the inference change if we let the coin bias be drawn from a continuous distribution (e.g., Beta distribution over [0,1]) rather than from the discrete set {0/r, 1/r, ..., r/r}?
In the discrete setup, we have r+1 specific biases. In a continuous setup:
Prior: Instead of having r+1 coins, each equally likely, you have a continuous prior distribution p(θ) over θ in [0,1]. A typical choice is Beta(α, β).
Posterior: Observing n heads (and 0 tails) would update the prior Beta(α, β) to Beta(α + n, β). The posterior distribution is continuous, so the probability of exactly θ is replaced by a density function.
Relation to the discrete approach: The discrete approach can be seen as an approximation to the continuous one if you place point masses at i/r. As r grows large, these points become denser, approximating the Beta distribution more closely.
Pitfall: If you try to blend discrete and continuous approaches incorrectly, you can get confusion in how to normalize probabilities. Always ensure that priors, likelihoods, and posteriors remain consistent in dimensional terms.
What if we care about the probability of observing future heads rather than identifying which coin was selected?
While the question focuses on “Which coin is chosen?”, in many applications we might care about the predictive distribution of the next toss. Once we have the posterior distribution P(coin i | E), the probability of seeing a head on the next toss is:
sum_{i=0 to r} [ P(coin i | E) * (i/r) ]
This sum effectively averages the biases of all coins, weighted by their posterior probabilities. Key points:
Practical significance: If you only care about the next toss, you do not necessarily need to identify a single coin as “the one”; you just compute the average predicted heads probability.
Edge cases: If your posterior is heavily dominated by high i coins, the predicted next toss probability might be near 1. Conversely, if some moderate i values still retain some posterior weight, the predicted probability is less extreme.
Extension to multiple future tosses: You can replicate the same logic for each future toss or update the posterior step by step with each new outcome.
Are there risks in overfitting the data if we flip the coin enough times?
Overfitting typically refers to scenarios where your model parameters become too tailored to the observed sample. In a Bayesian context:
Bayesian smoothing: Because you have a discrete set of coins (each with a fixed bias) and uniform prior 1/(r+1), the model has finite complexity. Overfitting in a classical sense is less of an issue, but you might end up with a posterior strongly favoring coin r (100% heads) if you see many heads in a row, even though a coin with bias slightly less than 1 could also produce that data.
Model mismatch: If the real coin has bias 0.95, but your discrete set jumps from 0.9 to 1.0, repeated heads might push the posterior heavily to coin r = 1, incorrectly concluding the coin always lands heads. This is not overfitting per se, but a result of the finite granularity of your coin set.
Practical mitigation: Increase r to get finer resolution. Or adopt a continuous prior over [0,1], as in a Beta-Bernoulli model, to handle real-valued biases smoothly.
Does the order in which the heads appear matter (e.g., HHHHH vs. HTHHH) if we only care about the total count of heads?
If all tosses are i.i.d. (independent and identically distributed) with probability i/r of heads for coin i, then the order does not matter for the likelihood. The likelihood depends solely on the number of heads and tails. For n heads in a row, or n heads in any order, the probability is (i/r)^n for coin i and does not involve permutations. Hence:
Exchangeability: Under the i.i.d. model, the data is exchangeable; the sum of heads is the key statistic. Therefore, HHHHH or HTHHH yield the same count (5 heads, 0 tails) and the same likelihood.
When order matters: If you introduce a different model where each flip can depend on the previous flip or the position in the sequence, then the order might matter. But in the problem as stated, it does not.
Could the posterior be sensitive if we mistakenly believe each coin is equally likely, but in reality, some coins are more likely a priori?
Yes, if in reality some coins are more likely than others, using a uniform prior leads to a potentially incorrect posterior. Consider these pitfalls:
Mis-specified prior: If coin s has probability 0.5 of being chosen but you mistakenly assume each coin is 1/(r+1), your posterior might drastically underestimate or overestimate the chance of s relative to reality, especially for moderate or small data sets.
Robustness: Bayesian inference can be somewhat robust if you eventually collect a large number of flips. The likelihood might dominate the prior for large n. But for small or medium n, the prior is critical.
Real-world scenario: Suppose coins have different physical appearances or labels that might make you select some coins more often. If you ignore those biases, your posterior is systematically off.