ML Interview Q Series: Sum of Binomials is Binomial, Conditional Distribution is Hypergeometric.
Browse all the Probability Interview Questions here.
Question: Let X and Y be independent random variables, where X is binomial with parameters n and p, and Y is binomial with parameters m and p. (a) Explain, in terms of Bernoulli experiments, why X+Y is binomial with parameters n+m and p, and then give a formal proof. (b) Show that for a fixed k, the probabilities P(X = j | X + Y = k) for j = 0,…,k form a hypergeometric distribution.
Short Compact solution
One way to see that X + Y follows a binomial distribution with parameters n + m and p is to imagine a total of n + m independent Bernoulli trials, each with probability p of success. The random variable X counts the number of successes in the first n trials, and Y counts the number of successes in the remaining m trials. By definition, X + Y then counts the total number of successes across all n + m trials, so X + Y is binomial(n + m, p).
Formally, because X and Y are independent, the probability P(X + Y = k) is the sum of P(X = r) P(Y = k − r) over r = 0,…,k. That summation and the binomial expansions simplify to the binomial(n + m, p). For the conditional distribution, P(X = j | X + Y = k) can be written as the ratio of P(X = j, Y = k − j) to P(X + Y = k). After simplification, those probabilities match a hypergeometric distribution with parameters n, m, and k.
Comprehensive Explanation
Intuitive Reasoning with Bernoulli Trials
We can think of a total of n + m independent Bernoulli trials, each having success probability p. The variable X counts how many successes occur in the first n trials, while Y counts how many successes occur in the last m trials. Because each trial is independent and has the same probability p of success, if we sum up successes over all n + m trials, that total X + Y must be binomial with parameters n + m and p.
Formal Proof That X + Y Is Binomial(n + m, p)
Because X is binomial with parameters n and p, it follows that for any integer r:
P(X = r) = (n choose r) p^r (1 − p)^(n − r)
Similarly, Y is binomial with parameters m and p, so for integer s:
P(Y = s) = (m choose s) p^s (1 − p)^(m − s)
Given X and Y are independent, the joint probability P(X = r, Y = s) is the product:
P(X = r, Y = s) = P(X = r) P(Y = s)
Hence, if we want P(X + Y = k), we sum over all ways X = r and Y = k − r can occur:
By substituting the binomial PMFs for X and Y, we get:
P(X + Y = k) = sum_{r=0..k} [ (n choose r) p^r (1 − p)^(n − r) * (m choose (k − r)) p^(k − r) (1 − p)^(m − (k − r)) ]
Combine powers of p and (1 − p):
= sum_{r=0..k} [ (n choose r) (m choose (k − r)) ] p^k (1 − p)^(n + m − k)
It can be shown through a known combinatorial identity that
sum_{r=0..k} (n choose r) (m choose (k − r)) = (n + m choose k)
Therefore,
Hence, X + Y follows a binomial distribution with parameters n + m and p.
Conditional Distribution Is Hypergeometric
We now look at P(X = j | X + Y = k). By definition,
P(X = j | X + Y = k) = P(X = j, Y = k − j) / P(X + Y = k).
Because X and Y are independent:
P(X = j, Y = k − j) = P(X = j) P(Y = k − j).
Thus:
P(X = j, Y = k − j) = (n choose j) p^j (1 − p)^(n − j) (m choose (k − j)) p^(k − j) (1 − p)^(m − (k − j)).
Meanwhile, we have already shown:
P(X + Y = k) = (n + m choose k) p^k (1 − p)^(n + m − k).
Hence,
$$ P(X = j \mid X + Y = k)
;=;\frac{(n \choose j),(m \choose k-j),p^k,(1 - p)^{,n+m-k}}{(n+m \choose k),p^k,(1 - p)^{,n+m-k}}. $$
Canceling out the common p^k and (1 − p)^(n + m − k), we get:
P(X = j | X + Y = k) = (n choose j) (m choose (k − j)) / (n + m choose k).
This is precisely the hypergeometric distribution with parameters n, m, and k. In other words, given the total number of successes k across n + m trials, the probability that j of those successes occurred in the first n trials matches the hypergeometric probability formula.
Follow-up Question 1: Interpretation via Sampling Without Replacement
An interviewer might ask how the hypergeometric distribution arises from “sampling without replacement.” We can see the binomial setting as sampling with replacement (each trial is independent). However, once we condition on X + Y = k total successes, we are effectively selecting which j successes are among the first n. This is analogous to sampling j successes out of k from the first n, while the remaining k − j are in the last m, much like a hypergeometric draw from an urn of k “success tickets” within a total of n + m tickets. Thus, the conditional distribution P(X = j | X + Y = k) appears as a hypergeometric.
Follow-up Question 2: Boundary and Edge Cases
Interviewers might also test understanding by asking about boundary cases:
When p = 0, both X and Y are almost surely 0, so X + Y is 0. This is consistent with binomial parameters n + m and p = 0.
When p = 1, both X and Y are almost surely n and m, respectively, so X + Y = n + m. This again matches the binomial(n + m, 1) distribution.
When k = 0, then X + Y = 0 implies X = 0 and Y = 0 with probability 1. Similarly, when k = n + m, that means all trials succeeded. These edge outcomes also align with standard binomial properties.
Follow-up Question 3: Practical Implementation with Python
An interviewer may ask how to simulate or compute these probabilities in code. For instance, using Python:
import math
from math import comb
import random
def binomial_pmf(k, n, p):
return comb(n, k) * (p**k) * ((1-p)**(n-k))
def hypergeom_pmf(j, n, m, k):
# j successes in n draws, out of total k successes in n+m
return comb(n, j)*comb(m, k-j) / comb(n+m, k)
# Simulate X ~ Binomial(n, p), Y ~ Binomial(m, p)
n, m, p = 5, 7, 0.3
num_samples = 10_000_00
count_XplusY = [0]*(n+m+1)
for _ in range(num_samples):
# Generate X and Y
X = sum(random.random() < p for _ in range(n))
Y = sum(random.random() < p for _ in range(m))
count_XplusY[X + Y] += 1
# Empirical distribution of X+Y
empirical_probs = [cnt / num_samples for cnt in count_XplusY]
# Compare with theoretical binomial(n+m, p)
theoretical_probs = [binomial_pmf(k, n+m, p) for k in range(n+m+1)]
This code illustrates generating independent binomial samples for X and Y, and then checking how often their sum equals each integer k from 0 to n + m, comparing with the binomial(n + m, p) formula for verification.
Follow-up Question 4: Typical Pitfalls
An interviewer may probe whether a candidate knows common pitfalls. One mistake might be to assume X + Y could be something other than binomial just because X and Y are different binomial distributions (with same p but different n and m). Another subtlety is forgetting the independence condition. If X and Y were not independent, the result for X + Y might not remain binomial. Finally, not recognizing the hypergeometric structure when conditioning on X + Y = k is a common oversight.
All these nuances help emphasize the importance of understanding both combinatorial identities and independence in probability theory.
Below are additional follow-up questions
Follow-up Question 1: Behavior When n or m Is Random
Suppose instead of having fixed integers n and m, we let them be random variables themselves (still independent of each other and of any Bernoulli trials). If n and m are random (with some known distribution on the nonnegative integers) and conditional on n and m we define X ~ Binomial(n, p) and Y ~ Binomial(m, p), how would we characterize the distribution of X+Y?
Answer Explanation (Detailed)
Once n and m are random, X and Y become mixtures of binomial distributions. Each distribution for n or m, combined with the binomial distribution for X or Y, leads to a more complex compound distribution.
If n and m are independent and identically distributed discrete random variables, the sum X+Y still depends on the realized values of n and m, but we typically cannot collapse X+Y into a simple Binomial(n+m, p) unless we condition on n and m.
The unconditional distribution of X+Y could be a Poisson-binomial type mixture or a Poisson mixture if n or m has a Poisson distribution. In the latter scenario (when n is Poisson(λ1) and m is Poisson(λ2)), the total count of successes might approach a Poisson distribution under certain limiting assumptions (e.g., p small and n large, or using a thinning argument).
The main pitfall is to assume independence and identical distribution of n and m alone is enough to yield a straightforward binomial distribution for X+Y. One must carefully handle the distribution of n+m. Conditioned on n+m, it remains binomial. Unconditionally, it becomes a more general compound distribution.
Follow-up Question 2: Effect of Varying p Over Trials
Imagine the scenario where in the first n trials the success probability is p1, while in the next m trials the success probability is p2, with p1 ≠ p2. What is the distribution of X+Y?
Answer Explanation (Detailed)
If p1 ≠ p2, then X ∼ Binomial(n, p1) and Y ∼ Binomial(m, p2), and X+Y generally will not be a simple Binomial(n+m, p). We lose the uniform success probability assumption required for a binomial distribution with n+m trials.
We still have:
P(X = r) = (n choose r) p1^r (1 − p1)^(n−r)
P(Y = s) = (m choose s) p2^s (1 − p2)^(m−s)
Because X and Y remain independent, the PMF of X+Y is the convolution of two binomial distributions with different p parameters. This distribution is sometimes referred to as a Poisson-binomial distribution in general, although with just two terms, it can be expressed in a closed form summation.
A common pitfall is to automatically label X+Y with a binomial distribution even though p1 ≠ p2. One must carefully handle the fact that each trial block uses a different probability of success.
Follow-up Question 3: Moment Generating Functions (MGFs)
Explain how to use the moment generating functions of X and Y to prove that X+Y ∼ Binomial(n+m, p), assuming X ∼ Binomial(n, p), Y ∼ Binomial(m, p), and independence.
Answer Explanation (Detailed)
The MGF of a binomial(n, p) random variable X is M_X(t) = (1 − p + p e^t)^n.
For Y ∼ Binomial(m, p), its MGF is M_Y(t) = (1 − p + p e^t)^m.
If X and Y are independent, the MGF of their sum is the product of their individual MGFs: M_{X+Y}(t) = M_X(t) × M_Y(t) = (1 − p + p e^t)^n × (1 − p + p e^t)^m = (1 − p + p e^t)^(n+m).
This is exactly the MGF of a Binomial(n+m, p) random variable. Therefore X+Y follows Binomial(n+m, p).
A subtle real-world pitfall is mixing up MGF with the characteristic function or probability generating function. While all are valid approaches, one must ensure the algebra and conditions for using them hold, including independence.
Follow-up Question 4: Handling Dependence Between X and Y
What happens if X and Y are not independent? For instance, X counts successes in n trials, and Y counts successes in those same n trials plus an additional m new trials (so the m new ones are disjoint, but the first n are common). Is X+Y still binomial(n+m, p)?
Answer Explanation (Detailed)
In that scenario, X is the number of successes in the first n trials, but Y includes successes from the first n trials (shared) and from additional m trials. Hence Y is no longer counting entirely disjoint trials; the first n are overlapping with X.
This creates dependence. Specifically, if X was high, then Y is more likely to be high (because part of Y is the same random outcome as the first n trials).
As a result, you cannot simply do P(X+Y = k) = sum P(X=r)P(Y=k−r) because they’re not independent. The sum X+Y in this overlapping scenario is not binomial(n+m, p). The distribution is more complicated and typically does not have a simple closed-form.
A major pitfall is to overlook correlation introduced by partial overlap in the trials. Real-world experiments can sometimes have partial overlap if the second “sample” reuses data from the first sample.
Follow-up Question 5: Large n+m and Central Limit Theorem (CLT) Perspective
If n+m is large, how does the distribution of X+Y behave, and when does a normal approximation become relevant?
Answer Explanation (Detailed)
For large n+m, the binomial(n+m, p) distribution can be well-approximated by a normal distribution with mean (n+m)p and variance (n+m)p(1 − p). This is a direct consequence of the Central Limit Theorem (CLT) or the De Moivre–Laplace theorem (a specific case of CLT for binomial distributions).
Specifically, as n+m → ∞, (X+Y − (n+m)p) / sqrt((n+m)p(1 − p)) converges in distribution to a standard normal random variable.
Pitfalls include using the normal approximation for small n+m or extreme values of p near 0 or 1. In those cases, the approximation can be poor. Another pitfall is forgetting the continuity correction factor if the interviewer wants a more accurate normal approximation to the binomial.
Follow-up Question 6: Relationship to Negative Binomial and Other Distributions
How does the distribution of X+Y relate to other well-known discrete distributions like the negative binomial? Could X+Y become negative binomial under certain transformations or conditions?
Answer Explanation (Detailed)
The negative binomial distribution typically arises when counting the number of trials needed to achieve a certain number of successes. That is a different process-based definition than simply adding two binomial counts that come from a fixed number of trials.
However, there is a relationship between sums of independent geometric or certain negative binomial variables. For instance, if X and Y were both geometric with parameter p, their sum might have a shifted negative binomial distribution. But that is conceptually different from X and Y being binomial(n, p) and binomial(m, p).
A pitfall is to conflate negative binomial with a sum of binomial distributions. While sums of binomial distributions with the same p remain binomial with a larger n, negative binomial does not typically appear as X+Y for binomial random variables unless there is a reinterpretation of how the random variables are defined.
Follow-up Question 7: Real-World Experimentation and Overdispersion
In real-world data, we often see that the empirical variance of the observed “success” counts can exceed the theoretical variance of the binomial model (1 – p)p(n+m). How might this discrepancy occur, and does that mean the binomial model is invalid?
Answer Explanation (Detailed)
Overdispersion can arise if the probability of success p is not truly constant across all n+m trials, or if trials are not strictly independent. For example, external factors, batch effects, or correlation between individual trials can inflate the variance beyond the binomial assumption.
Overdispersion does not necessarily invalidate the binomial model altogether, but it signals that either the data might need a beta-binomial model (where p is itself random) or a mixed model that accommodates variations across trials.
A pitfall is to blindly apply the binomial distribution to data that show strong clustering or correlation of successes. The correct approach is to diagnose whether the binomial assumption is truly appropriate or if a more complex model is needed.
Follow-up Question 8: Partial Observability in Practice
In practice, one may only observe X+Y for some experiments but not observe X and Y individually. How does one infer p or test the binomial hypothesis in such partial observability cases?
Answer Explanation (Detailed)
If we only observe the total number of successes out of n+m trials (i.e., X+Y) but do not know how many came from the first n or the last m, we can still estimate p by MLE: the empirical fraction of successes in n+m trials is (X+Y)/(n+m).
However, testing whether the distribution truly splits into X ∼ Binomial(n, p) and Y ∼ Binomial(m, p) might require additional data or assumptions. We need to know that the first n and last m trials are truly independent and identically distributed.
A common pitfall is to assume independence and identical distribution of the two sets of trials without verifying the conditions. If the environment changes halfway through data collection, the first n trials might differ systematically from the last m.
Follow-up Question 9: Confidence Intervals for p and for X+Y
If we want to provide a confidence interval for p based on observed X+Y, how do we proceed, and what complexities can arise if we also want to bound the distribution of X+Y itself?
Answer Explanation (Detailed)
When we have a single binomial sample of size n+m, we can build a confidence interval for p using standard binomial-proportion confidence interval methods (e.g., Clopper–Pearson, Wilson, or normal approximation intervals).
Complexity arises if we want separate estimates of p for the two subsets of trials or if n and m are large but not necessarily the same order of magnitude. The intervals might need separate data to identify differences in success rates.
Another complexity is if X and Y each correspond to different underlying populations or treatments. Then forming a single estimate p from X+Y might average out meaningful differences. This leads to potential biases.
Follow-up Question 10: Multiple Comparisons with Several Binomial Sums
If we run multiple binomial experiments (e.g., X1 ∼ Binomial(n1, p), X2 ∼ Binomial(n2, p), …, Xk ∼ Binomial(nk, p)) and sum them all, how do we handle statistical inferences (confidence intervals or hypothesis tests) in a scenario with many sums?
Answer Explanation (Detailed)
In principle, if all Xi are independent Binomial(ni, p) with the same p, the sum S = X1 + X2 + … + Xk is Binomial(n1 + n2 + … + nk, p). In that sense, the combined inference for p could use S as a single count out of (n1 + n2 + … + nk) total trials.
However, if each Xi is used to make separate decisions or separate inferences, we face multiple-comparison challenges. One example is controlling the overall Type I error rate across k tests.
A subtle pitfall is ignoring correlation or differences among the k experiments. If p changes or if the trials for each Xi differ in crucial ways, we can overstate confidence. Multiple testing corrections (like Bonferroni or false discovery rate methods) may be needed.