ML Interview Q Series: Shifted Geometric Branching Process: Calculating Extinction Probability via Generating Functions.
Browse all the Probability Interview Questions here.
In a branching process with one ancestor, the number of offspring of each individual has the shifted geometric distribution p_k = p(1-p)^k for k=0,1,... with parameter p in (0,1). What is the probability of extinction as a function of p?
Short Compact solution
The generating function of the offspring distribution is
We find the extinction probability by solving the fixed-point equation u = P(u). This yields
The two solutions are u = 1 and u = p/(1-p). The extinction probability is always the smallest solution that lies in [0, 1]. Thus, if p/(1-p) < 1 (i.e. p < 1/2), the extinction probability is p/(1-p). Otherwise, it is 1.
Comprehensive Explanation
Overview of Branching Processes and Extinction Probability
In a branching process (specifically a Galton–Watson process) starting with a single ancestor, each individual in the population produces a random number of offspring according to the same probability distribution. We want to find the probability that the entire population eventually dies out (i.e. becomes extinct) over an infinite time horizon.
To study the extinction probability, we use the probability generating function of the offspring distribution. If X is the number of offspring of a single individual, and p_k = P(X = k), then its generating function G(s) = E[s^X] = sum_{k=0 to infinity} p_k s^k. The extinction probability q is the smallest solution of q = G(q) in the interval [0,1].
Shifted Geometric Offspring Distribution
Here, each individual has a "shifted geometric" distribution for its offspring count, meaning:
p_k = p (1-p)^k for k = 0,1,2,…
where p in (0,1). Notice that k=0 is possible, meaning an individual can have zero offspring.
Deriving the Generating Function
For this distribution, the probability generating function P(u) is:
This series is a geometric series in u^k with ratio (1-p)u. Provided |(1-p)u| < 1, the sum of the series is:
Finding the Extinction Probability
The extinction probability, say q, satisfies q = P(q). Hence, we solve:
q = p / [1 - (1-p) q].
We can rearrange to get:
(1 - p) q^2 - q + p = 0.
This is a quadratic equation in q. The solutions are:
q = 1
q = p/(1-p)
Because an extinction probability must lie between 0 and 1, the relevant solution is chosen as the smallest root in the interval [0,1]. We check the value p/(1-p):
If p/(1-p) < 1, that implies p < 1/2. In that case, q = p/(1-p).
Otherwise, the only valid solution within [0,1] is q = 1, which holds if p >= 1/2.
Thus, the final extinction probability q is:
p/(1-p) when p < 1/2
1 otherwise
Interpretation in Terms of Subcritical, Critical, and Supercritical Regimes
In branching processes, one typically compares the mean offspring m to 1:
m < 1: subcritical (population usually dies out)
m = 1: critical (population has a high chance of dying out, but can also sustain indefinitely with small probability)
m > 1: supercritical (population may survive with some positive probability)
For a shifted geometric distribution p_k = p(1-p)^k, the expected number of offspring m is:
m = E[X] = sum_{k=0 to infinity} k * p(1-p)^k.
It can be shown that m = (p(1-p)) / p^2? Actually let's do it carefully inline without LaTeX: E[X] = (1-p)/p. (One can derive that using the fact that a "shifted geometric" with parameter p has mean (1-p)/p.)
If p < 1/2, then E[X] = (1-p)/p > 1, so it is a supercritical process, but the process might still become extinct with some probability less than 1.
If p = 1/2, then E[X] = 1, a critical regime.
If p > 1/2, then E[X] = (1-p)/p < 1, subcritical, hence extinction probability is 1.
Hence, it aligns perfectly with the root selection we found.
Potential Follow-up Questions
Why do we pick the smaller root of the quadratic?
In general for generating functions of branching processes, the extinction probability is the smallest fixed point of the generating function in the [0,1] interval. The function G(u) is increasing and convex in u, and the extinction probability is always the minimal solution in [0,1] to G(u) = u. Any solution larger than that minimal one does not correspond to the actual probability of extinction.
What happens if p approaches 1/2 from below or above?
Approaching p from below 1/2, the value of p/(1-p) approaches 1. Precisely at p=1/2, p/(1-p) = 1, so the smaller solution merges with 1.
Once p crosses 1/2, the smaller solution becomes greater than 1 and ceases to be valid within [0,1], so 1 is the only valid solution (complete extinction with probability 1).
Could we simulate this branching process in Python to empirically estimate extinction probability?
Yes. You can write a simulation that starts with a single ancestor and evolves generation by generation. Below is a simple illustration:
import numpy as np
def simulate_branching(p, num_simulations=10_000, max_generations=1000):
"""
p: float, parameter in (0, 1)
num_simulations: int, how many branching processes to simulate
max_generations: int, to cap the simulation (avoid infinite loops)
"""
extinction_count = 0
for _ in range(num_simulations):
# Start with one ancestor
population = 1
generation = 0
# Evolve until population is 0 or max_generations reached
while population > 0 and generation < max_generations:
offspring = 0
# Each individual spawns number of children with distribution p_k = p(1-p)^k
for _ in range(population):
k = 0
# You can use geometric sampling or direct approach
# We'll sample from a geometric distribution shifted by 0
# np.random.geometric(p) - 1 would produce a distribution with mean (1-p)/p
# but let's do a direct while-based approach:
while np.random.rand() > p:
k += 1
offspring += k
population = offspring
generation += 1
if population == 0:
extinction_count += 1
return extinction_count / num_simulations
# Example usage:
p_value = 0.45
estimate = simulate_branching(p_value, num_simulations=10000, max_generations=1000)
print(f"Empirical extinction probability for p={p_value} is approximately {estimate:.3f}")
For p < 1/2, you should see an empirical extinction probability close to p/(1-p). For p >= 1/2, it should be close to 1.
Are there any edge cases or boundary conditions?
p very close to 0: Then with high probability you have k=0. Indeed, if p is close to 0, (1-p) is near 1, so the distribution is heavily skewed towards large k, but that might not be stable because p is the probability of terminating the geometric count. In fact, if p is extremely small, the average offspring can be extremely large, and the extinction probability is less than 1.
p very close to 1: Then with high probability k=0. The process almost always produces zero offspring, so extinction is 1.
Exactly p=1/2: Then the mean offspring is 1, so it is a critical case, and the extinction probability is 1.
What if we interpret p as the parameter of a geometric distribution in some alternative forms?
Sometimes, the geometric distribution is defined differently in different texts (for example, support k=1,2,3,...). However, in this problem we have a specific “shifted geometric” on k=0,1,2,... with parameter p in (0,1). The reasoning and final formula remain the same as long as we consistently use the correct probability generating function.
All these clarifications confirm that the extinction probability for the given branching process is p/(1-p) for p < 1/2 and 1 otherwise.
Below are additional follow-up questions
If the parameter p is unknown, how can we estimate it from observed branching process data?
When dealing with real-world scenarios, it is common that the true p (the probability that we “stop” adding more offspring in the geometric count) is unknown. We may observe multiple realizations or partial observations of the process (e.g., generation sizes) and want to infer p. One approach is to observe the total number of offspring produced by many individuals and then use methods like Maximum Likelihood Estimation (MLE). Concretely:
In a single generation, if we observe N individuals, and each has a certain number of offspring X_i, then each X_i has the shifted geometric distribution p_k = p(1-p)^k for k=0,1,2,...
The log-likelihood for a single observation X = k is log L(p) = log p + k log(1-p).
Summing over all observations from multiple individuals, we get the total log-likelihood as the sum of these terms.
We can then take the derivative of the sum w.r.t. p, set it to zero, and solve for p.
In practice, we might also employ Bayesian methods with a prior over p, or rely on numerical optimization libraries if the sum is large.
A subtle pitfall is if the process is observed only after it has gone extinct in certain runs. In that case, the data may be biased toward smaller generation sizes, leading to biased estimates of p. One must carefully handle censored or truncated observations to get an accurate parameter estimate.
What if the branching process starts with more than one initial ancestor?
Though the derivation we showed assumes exactly one initial ancestor, in many practical situations one may start with multiple ancestors. In that scenario:
We can think of the process as the sum of identical independent Galton–Watson processes, each starting from a single ancestor.
The extinction probability of the combined process is the probability that all of these independent lines die out. If the probability of extinction for a single line is q, then for M initial ancestors the overall extinction probability is q^M, provided each lineage is independent.
This is straightforward for processes in which offspring distributions are identical and independent across individuals. However, if any coupling or dependency exists between lineages, this independence-based result no longer strictly applies.
A real-world challenge can arise if the genealogies of the multiple ancestors somehow overlap (e.g., offspring from different ancestors interact). Then the standard formula must be modified, and the process is no longer a simple sum of independent branching processes.
How is the extinction time (i.e., the random time until the population dies) distributed, and what factors influence it?
Besides knowing if the process eventually goes extinct, one might want to know how long until extinction typically occurs. Key observations include:
In subcritical or critical regimes (where E[X] <= 1), extinction is certain or almost certain, and the typical time to extinction might not be too large.
In supercritical regimes (where E[X] > 1), the population still might die out, but it can also “explode” rapidly. If extinction does occur, it may happen relatively quickly; if it doesn’t, the population tends to grow.
To analyze extinction time, we would typically look at P(T <= t), where T is the time (in discrete generations) at which the population hits zero for the first time. This distribution is more involved than just the extinction probability. For a shifted geometric distribution, one can theoretically derive the generating function of T, but it can get mathematically intricate.
Simulation is often a practical tool: we can run many replicates, record the time to extinction for each replicate, and empirically approximate its distribution.
A pitfall is that if p < 1/2 (supercritical), there is a positive chance that the process never goes extinct, so T = infinity in some realizations. In that case, the “conditional” distribution of T (conditional on extinction actually happening) is finite, but we must carefully handle those realizations where the process “survives forever.”
How do correlation or dependence among offspring affect the standard result?
The standard Galton–Watson framework assumes each individual’s offspring distribution is independent of the number of offspring of other individuals. In reality, there may be correlations—e.g., if individuals share resources, if there is some environmental feedback, or if there’s a genetic component that makes large families more likely in certain lineages. Once you introduce dependence, the classic approach using P(u) = E[u^X] and q = P(q) is no longer strictly valid. Common pitfalls include:
Overestimating or underestimating extinction probability: if offspring counts are positively correlated, the population is more prone to large bursts and might have a lower extinction probability than predicted under independence.
Analytical challenges: deriving a neat formula for the extinction probability becomes significantly more complicated. Approximations or specialized techniques (like coupling arguments) may be needed.
In such scenarios, simulation or more advanced branching process models (e.g., multi-type branching or branching processes in a random environment) are often used.
Could environmental variation over time alter the extinction probability?
Yes. The classical derivation presumes a fixed distribution for offspring across all generations. But real-world environments (e.g., availability of resources, seasonal factors) can cause the offspring distribution to change with time. This leads to the concept of a branching process in a random (or varying) environment:
In a random environment, each generation might have its own parameter p, say p_1, p_2, …, making the offspring distribution for generation t different from that of generation t+1.
The extinction probability then depends on the sequence p_1, p_2, …, and might be derived from more advanced theorems that analyze products of generating functions or average growth rates over time.
A key pitfall is incorrectly assuming a single p for all generations, thus overlooking the possibility that in many real-life systems, “bad years” or “good years” occur and drastically change the survival dynamics.
What if there is a maximum allowed number of offspring for each individual?
In practice, certain processes might cap the number of offspring at some maximum K for resource reasons or by design (e.g., in hardware duplication systems). This is no longer an infinite-support shifted geometric distribution. Instead, p_k could be truncated for k > K. Such truncation can:
Reduce the variance of the distribution and typically increase the probability of extinction, because extremely large offspring counts are impossible.
Affect the generating function, as it would be a finite sum instead of an infinite series. We would have something like P(u) = sum_{k=0}^K p_k u^k.
Require re-solving the equation q = P(q) with the revised distribution. The basic principle remains the same, but the function P(u) changes.
A subtlety is that if K is large enough relative to typical population size, you might not see the effect in small-sample simulations. However, in rare events, that cutoff could drastically alter the long-term survival chance.
Could partial observation of the process mislead the estimate of extinction probability?
Often, we may not observe every generation precisely. We might only have snapshots at certain time intervals or partial data about the population size. Examples of pitfalls:
If we only see the generation sizes at certain time steps (e.g., every 2 or 5 generations), we might miss information about near-extinction dips that occurred in the intermediate steps.
We might incorrectly conclude that the population is on a path to indefinite growth when it is actually on a path to extinction if we do not track the process carefully enough.
Model-based inference (e.g., hidden Markov models for partial observation) can adjust for these data gaps, but any mismatch between model assumptions and real data can lead to erroneous conclusions.
In large-scale scenarios, how does the branching process approach compare to continuous-time birth–death processes?
In certain real applications (biology, network growth models, chemical reactions), time might be better treated as continuous, with each individual reproducing (or dying) at random times. This leads to continuous-time Markov branching processes, often called birth–death processes, where each individual lives for some exponential waiting time, then splits or dies. The discrete-time approach with the shifted geometric distribution is a special case of a more general phenomenon:
In continuous-time models, one defines rates for birth and death, and the extinction probability emerges from the ratio of those rates.
The “shifted geometric in discrete time” roughly corresponds to a scenario in continuous time where the probability of producing more offspring declines geometrically in small increments of time, but the exact translation is not straightforward.
For advanced modeling, we compare the mean reproduction rate per unit time to the death rate. If the reproduction rate is higher, the process might be supercritical with a positive probability of infinite survival. Otherwise, extinction is certain.
A pitfall arises when trying to map parameters from a discrete-time model directly onto a continuous-time model without carefully matching rates. The eventual extinction result could differ significantly if we handle time incorrectly.