ML Interview Q Series: Uniform Heads Count via Beta-Binomial: Tosses with Uniformly Random Probability p
Browse all the Probability Interview Questions here.
You draw at random a number p from the interval (0,1). Next, you toss n times a coin with probability p of heads. What is the probability mass function of the number of heads that will appear?
Short Compact solution
By conditioning on p and integrating over its uniform distribution on (0,1), one obtains
Comprehensive Explanation
To see why this result holds, we first write down the conditional probability that X = k given a particular p:
The coin is tossed n times.
Each toss has a probability p of landing heads.
Therefore, conditioned on p, the number of heads X follows a Binomial distribution with parameters n and p. Its probability is comb(n,k) * p^k * (1-p)^(n-k).
Since p itself is drawn from a Uniform(0,1) distribution, the unconditional probability P(X = k) is found by integrating out p:
Here n is the total number of tosses, and k is the observed number of heads. The factor binom(n,k) is the binomial coefficient counting how many ways to choose k heads out of n tosses, p^k is the probability of observing heads in exactly k of those tosses, (1 - p)^(n - k) is the probability of observing tails in the remaining n - k tosses, and dp indicates integration over all possible values of p.
Next, we use the known Beta integral identity:
where r and s are positive integers. In our scenario, we match p^k (1-p)^(n-k) with x^{r-1} (1-x)^{s-1} by letting r = k + 1 and s = n - k + 1. Substituting and evaluating, one obtains
(k+1 - 1)! = k!
(n-k+1 - 1)! = (n-k)!
(k + (n-k) + 1 - 1)! = (n+1 - 1)! = n!
So the integral becomes
comb(n,k) * [k! (n-k)! / (n+1)!] * n!,
and after simplification, everything reduces neatly to 1/(n+1). Therefore,
This means that the number of heads X is actually distributed uniformly on the set {0,1,2,...,n} whenever p is itself drawn from a Uniform(0,1) distribution.
Why does the final distribution turn out to be uniform on {0,...,n}?
A useful way to interpret this is through a Bayesian perspective. A Uniform(0,1) prior on p is mathematically equivalent to a Beta(1,1) prior. Then, the Beta-Binomial model tells us that integrating a Binomial(n,p) likelihood against a Beta(1,1) prior yields a simple closed-form result. The Beta(1,1) prior is “uninformative,” spreading probability equally over all possible values of p. Consequently, each possible count of heads (from 0 to n) becomes equally likely.
Could another prior on p change the distribution of X?
Yes. If instead of a Uniform(0,1) prior, we choose a Beta(a,b) distribution for p, then the number of heads follows a Beta-Binomial distribution with parameters n, a, and b:
P(X = k) = comb(n,k) * (Beta(k + a, n - k + b) / Beta(a,b)).
In general, that leads to a different (non-uniform) distribution for X. The uniform-on-{0,...,n} result is a special case of the Beta-Binomial when a = 1 and b = 1.
How can we verify this result via simulation in Python?
Below is a quick simulation that empirically confirms P(X = k) = 1/(n+1). We pick a random p from Uniform(0,1), then toss an n-sided (heads/tails) coin according to p, and repeat many times.
import random
from math import comb
def simulate_distribution(n, trials=1000000):
counts = [0] * (n+1)
for _ in range(trials):
# Draw p uniformly from (0,1)
p = random.random()
# Toss the coin n times with probability p for heads
heads = sum(random.random() < p for __ in range(n))
counts[heads] += 1
# Convert counts to empirical probabilities
return [c / trials for c in counts]
# Example usage:
n = 5
empirical_probs = simulate_distribution(n)
print("Empirical distribution of number of heads:", empirical_probs)
print("Theoretical distribution:", [1/(n+1)]*(n+1))
If you run this code for sufficiently large trials, you will observe that the empirical probabilities for X = 0, 1, 2, ..., n cluster very close to 1/(n+1).
How does this relate to Bayesian updating?
When you observe k heads out of n tosses, the posterior distribution of p (starting with a Uniform(0,1) prior) becomes Beta(k+1, n-k+1). But here, the question only asks about the marginal distribution of X itself before seeing the outcome. That marginal distribution, derived from a uniform prior, is discrete uniform over {0,...,n}.
Are there any intuitive interpretations for small n?
If n=1, then X takes values 0 or 1. Because p is uniform on (0,1), the probability that the single toss is heads is integrated as 1/2, and the probability that it is tails is also 1/2.
If n=2, then X can be 0, 1, or 2 heads, each with probability 1/3.
In general, for any n, each integer outcome from 0 to n occurs with the same probability 1/(n+1).
Hence, the uniform distribution on p in (0,1) “balances” out the binomial variability so that, marginally, X is discrete uniform from 0 to n.
Below are additional follow-up questions
What if the coin tosses are not independent?
When we derived the probability P(X = k) = 1/(n+1), we relied on the assumption that each toss is an independent Bernoulli trial conditioned on p. If there is temporal dependence or correlation between tosses—for instance, if the outcome of one toss influences the next—then the binomial distribution no longer holds for a fixed p. As a result, the integral over p would not simplify in the same neat manner. A common pitfall here is to try to apply the same Beta-binomial framework despite correlated data; this leads to inaccurate inference. In a real-world scenario, we might use a more complex model, such as a hidden Markov model with a Beta prior on the underlying probability of heads, or we might introduce correlation parameters and perform a joint Bayesian analysis. The key difficulty is that the closed-form Beta-binomial solution depends critically on the conditional independence of each trial.
How do we handle the case where p might be restricted or not strictly within (0,1)?
Sometimes in practice we encounter a “coin” with a probability that might be biased to never be extremely small or extremely large. For instance, we might have prior domain knowledge that p is in (0.2, 0.8) only. If you try to integrate the binomial probabilities from 0.2 to 0.8 instead of 0 to 1, you no longer get the simple 1/(n+1) result. Instead, you must integrate from p=0.2 to p=0.8 and renormalize appropriately. This can be done numerically, or if you adopt a Beta(a,b) prior truncated to (0.2, 0.8), you can apply the Beta-binomial integral but with limits 0.2 to 0.8. The primary subtlety is ensuring that your probability distribution over p is valid (i.e., integrates to 1) and that you apply the binomial likelihood correctly. Failing to renormalize can lead to probabilities that do not sum to 1 over the entire set of outcomes.
Is there a possibility of degenerate cases for n=0 or n=1?
Yes, these edge cases can trip up an implementation if not handled carefully.
For n=0 (no tosses), the random variable X is always 0 because we have zero trials, so P(X=0) = 1. A direct application of the Beta-binomial formula still works in principle, but we need to interpret comb(n,k) carefully.
For n=1, we have P(X=0) = 1/2 and P(X=1) = 1/2 when p is uniform(0,1). This is a trivial check of the formula 1/(1+1) = 1/2. In real code or practical settings, these corner cases might need a special branch or at least thorough unit tests to ensure no division-by-zero or unexpected behaviors.
Could measurement errors in the coin toss outcomes affect the result?
In some real-world experiments, you might not be entirely sure if a “head” was truly a head. For instance, if the detection mechanism is noisy or there’s a possibility of mislabeling heads and tails, your observed count k might be off. If you treat the observed k as exact, but in reality there’s classification error, the Beta-binomial inference is no longer correct. This can lead to biased estimates of p and inaccurate predictions about the distribution of X. One way to address it is by introducing a misclassification model, effectively modeling the probability that you see heads given it was actually heads and the probability you see heads given it was actually tails. Then, you would integrate over both p and the error parameters. This drastically complicates the closed-form integration, but captures real-world conditions where measurement is imperfect.
How does one proceed if the coin toss experiment is part of a larger hierarchical model?
In larger Bayesian models (e.g., hierarchical or multilevel models), p might itself be drawn from a higher-level distribution that depends on other unknown parameters. For example, you could have multiple coins (or multiple groups) each with its own p_i, and these p_i are drawn from some Beta distribution with hyperparameters a and b that are also unknown. The uniform prior on each p might itself be replaced by a Beta(a,b) whose parameters in turn have their own prior. Then, you do a fully Bayesian joint inference over all p_i and the hyperparameters. The Beta-binomial formula is still a building block, but you will typically need more advanced sampling or variational techniques to perform inference at scale. The pitfall here is that ignoring the hierarchical structure when it is truly present can lead to overconfident estimates and less robust predictions (since partial pooling is not accounted for).
Does the number of draws n have to be fixed beforehand?
In standard binomial settings, we assume the number of coin tosses n is fixed. However, there are scenarios where you continue tossing the coin until a certain stopping criterion is reached. That yields a different distribution (for example, a negative binomial if you stop after a set number of heads). In that scenario, if p is random and has a uniform prior, we do not get a Beta-binomial distribution for the total count of heads in the usual sense because the sample size itself is random. Instead, you have to model the stopping rule explicitly. If your stopping rule is purely time-based (like a maximum number of tosses), you are good to use the Beta-binomial. But if it depends on the number of heads seen so far, the posterior distribution for p might be biased by the stopping rule, leading to nontrivial analysis. It is crucial to incorporate that into the likelihood when performing the integration over p.
How should we interpret the result in practical scenarios where the coin might not be truly random but rather adversarial?
In adversarial or strategic settings, an entity controlling the coin might adaptively change p as the experiment proceeds, contrary to the assumption that p is a single, fixed parameter drawn from a distribution. If you treat p as a static random draw while an adversary is actively trying to manipulate outcomes, you can drastically underestimate how often heads appear under some conditions or how tails might be forced under others. In such adversarial contexts—like repeated games or market scenarios—you might need game-theoretic or robust statistical methods rather than a straightforward Bayesian approach. Simply integrating out p under a uniform prior will misrepresent the actual risk or variability. A key pitfall is to blindly apply Beta-binomial logic while ignoring that p could be nonstationary or even determined by an adaptive opponent.