ML Interview Q Series: Modeling Tay-Sachs Inheritance Probability Using the Binomial Distribution
Browse all the Probability Interview Questions here.
Tay-Sachs disease is a rare fatal genetic disease occurring chiefly in children, especially of Jewish or Slavic extraction. Suppose that we limit ourselves to families which have (a) exactly three children, and (b) which have both parents carrying the Tay-Sachs disease. For such parents, each child has an independent probability 1/4 of getting the disease. Let X be the random variable representing the number of children that will have the disease.
(a) Show (without using any knowledge you might have about the binomial distribution!) that the probability distribution for X is as follows: P(X=0) = 27/64, P(X=1) = 27/64, P(X=2) = 9/64, P(X=3) = 1/64.
(b) Show that E(X)=3/4 and that Var(X)=9/16.
Short Compact solution
We consider all possible ways that three children can be born healthy (H) or with the disease (D). There are 2^3 = 8 total sequences. Let H be the event a child is healthy (with probability 3/4) and D be the event a child has the disease (with probability 1/4). Each of the 8 sequences has a probability found by multiplying the appropriate factors of 3/4 and 1/4.
Only one sequence is H,H,H, and it has probability (3/4)(3/4)(3/4) = 27/64, leading to X=0. There are three sequences with exactly one D, each with probability 9/64, so in total they sum to 27/64 for X=1. Similarly, there are three ways to get exactly two D’s, each of probability 3/64, adding to 9/64 for X=2. Finally, there is just one sequence with D,D,D having probability (1/4)(1/4)(1/4) = 1/64, so X=3.
From these probabilities, it follows that E(X)=3/4 and, by direct calculation or known binomial variance formulas, Var(X)=9/16. Indeed, this is a binomial distribution with parameters n=3 and p=1/4.
Comprehensive Explanation
To understand why this distribution is binomial and how we arrive at the probabilities 27/64, 27/64, 9/64, and 1/64, we can walk step by step through the reasoning:
Counting all possible outcomes
We have three children, each can be Healthy (H) or Diseased (D). Since these events are independent and each child has probability 1/4 of being diseased and 3/4 of being healthy, the total number of equally likely “types” of outcomes is 2^3 = 8. However, each outcome does not have the same probability (because “D” and “H” do not have probability 1/2 each). Instead, for each of the 8 possible sequences, we multiply together the corresponding 3/4 or 1/4 terms.
Example sequences:
H,H,H: probability (3/4)(3/4)(3/4) = 27/64
H,H,D: probability (3/4)(3/4)(1/4) = 9/64
D,D,D: probability (1/4)(1/4)(1/4) = 1/64 ... and so on, covering all 8.
Summarizing outcomes by X, the number of diseased children
Let X be the number of D’s (diseased children). Then:
X=0 happens only in the single sequence H,H,H (probability 27/64).
X=3 happens only in the single sequence D,D,D (probability 1/64).
X=1 occurs in three sequences (H,H,D; H,D,H; D,H,H), each 9/64, summing to 27/64.
X=2 occurs in three sequences (H,D,D; D,H,D; D,D,H), each 3/64, summing to 9/64.
Binomial distribution perspective
We can recognize X as a binomial random variable with n=3 and p=1/4, because “success” (child has the disease) occurs independently for each of n=3 children with probability p=1/4. Hence, the probability that exactly k children have the disease is given by the standard binomial formula:
where k can be 0, 1, 2, or 3.
Calculating mean and variance
For a binomial random variable with parameters n and p, the mean is E(X)=n p, and the variance is Var(X)=n p (1 - p). Here n=3 and p=1/4.
Thus:
These match the directly computed results from enumerating the 8 sequences.
Potential Follow-Up Question: How do we derive E(X) and Var(X) by summation instead of using the binomial formula?
Even if we did not recall the binomial distribution, we could compute E(X) and Var(X) using the definitions:
E(X) = sum over k of (k * P(X=k))
E(X^2) = sum over k of (k^2 * P(X=k)), hence Var(X) = E(X^2) - [E(X)]^2
Explicitly, from P(X=0)=27/64, P(X=1)=27/64, P(X=2)=9/64, P(X=3)=1/64, we compute:
E(X) = 0*(27/64) + 1*(27/64) + 2*(9/64) + 3*(1/64) = (27/64) + (18/64) + (3/64) = 48/64 = 3/4.
Then E(X^2) = 0^2*(27/64) + 1^2*(27/64) + 2^2*(9/64) + 3^2*(1/64) = (27/64) + (36/64) + (9/64) = 72/64 = 9/8.
So Var(X) = E(X^2) - [E(X)]^2 = (9/8) - (3/4)^2 = (9/8) - (9/16) = (18/16 - 9/16) = 9/16.
This matches the result from the binomial formula.
Potential Follow-Up Question: Why is Tay-Sachs inherited with probability 1/4 in this scenario?
When each parent is a carrier (heterozygous for the Tay-Sachs mutation), classical Mendelian genetics says each child has:
1/4 chance to inherit the mutant allele from both parents (becoming diseased),
1/2 chance to inherit one mutant and one normal allele (carrier but not diseased),
1/4 chance to inherit normal alleles from both parents (healthy, not a carrier).
Hence the probability that a child is actually diseased is 1/4. This is a direct application of simple autosomal recessive inheritance.
Potential Follow-Up Question: Are there any assumptions we are making that might fail in real populations?
Several assumptions are implicit:
Independence of each birth event (i.e., one child having Tay-Sachs does not affect the next child’s probability). Real-world scenarios may have subtle correlations or different medical interventions after an affected birth.
Exact 1/4 inheritance risk from Mendelian genetics. In practice, there could be complexities like incomplete penetrance or variable expression, though for Tay-Sachs specifically, it is typically well-defined under autosomal recessive inheritance.
No selective factors altering family size or birth outcomes after a diseased child. In real-life, sometimes parents’ decisions or other factors might cause correlation in the number of births.
Despite these potential limitations, the binomial model is the standard theoretical treatment for such a problem under idealized assumptions.
Potential Follow-Up Question: How does this distribution generalize for different numbers of children and different probabilities?
If a family has n children, each with probability p of inheriting an autosomal recessive disease, then the number of diseased children follows Binomial(n, p). The general formulas are:
P(X=k) = (n choose k) p^k (1 - p)^(n - k),
E(X) = n p,
Var(X) = n p (1 - p).
For any n and p, these are standard results if each child’s outcome is an independent Bernoulli trial with success probability p.
Potential Follow-Up Question: Could we code a simulation to empirically verify these probabilities and statistics?
Yes. In Python, we can run a simulation by repeatedly generating three independent Bernoulli(1/4) random variables and counting how many times the sum of these three is 0, 1, 2, or 3. For example:
import random
def simulate_tay_sachs(num_families=10_000_000):
counts = [0, 0, 0, 0] # for X=0,1,2,3
for _ in range(num_families):
# Generate 3 independent diseased/healthy indicators
x = sum(random.random() < 0.25 for _ in range(3))
counts[x] += 1
# Convert counts to probabilities
probs = [count/num_families for count in counts]
return probs
probs_estimated = simulate_tay_sachs()
print(probs_estimated)
If we run this, we should see approximate values near [27/64, 27/64, 9/64, 1/64]. We could also compute average and variance of X across the simulation and expect values close to 3/4 and 9/16 respectively.
This empirical validation is a practical demonstration that the binomial model is accurate under the ideal assumptions.