ML Interview Q Series: Predicting Rare Tunnel Vehicle Breakdowns Using Poisson Approximation
Browse all the Probability Interview Questions here.
Question
Given that 0.04% of vehicles break down when driving through a certain tunnel, find the probability of (a) no breakdowns and (b) at least two breakdowns in an hour when 2,000 vehicles enter the tunnel.
Short Compact solution
The number of breakdowns X can be modeled by a Binomial distribution with n=2000 and p=0.0004. Because n is large and p is small, we approximate X with a Poisson distribution having rate lambda=2000*0.0004=0.8. Using this approximation:
Probability that X=0 is approximately 0.4493
Probability that X>=2 is 1 - P(X<=1), which is 1 - 0.8088=0.1912
Comprehensive Explanation
Binomial Distribution Perspective
When we have a number of independent trials n, each with success probability p (where a "success" here means “vehicle breaks down”), the total number of successes X follows a Binomial distribution. The Binomial pmf (probability mass function) for X=k is often written as:
where n is the total number of trials (vehicles), p is the probability of success (breakdown) for each trial, and k is the number of observed successes (breakdowns).
In this problem, n=2000 and p=0.0004. Strictly speaking, one could directly compute:
P(X=0) = (1 - 0.0004)^(2000)
P(X=1) = 2000 * (0.0004) * (1 - 0.0004)^(1999)
etc.
However, because n is large (2000) and p is quite small (0.0004), we frequently use a Poisson approximation for convenience and speed of computation.
Poisson Approximation
For large n and small p, the Binomial(n, p) distribution can be approximated by a Poisson distribution with rate lambda = n*p. Here,
lambda = 2000 * 0.0004 = 0.8
Hence we approximate X by Poisson(0.8). The pmf of a Poisson random variable X with rate lambda is:
where lambda is the expected number of events (in this context, breakdowns), and k is the number of occurrences.
(a) Probability of no breakdowns
Using Poisson(0.8), the probability that X=0 is:
P(X=0) = (0.8^0 * e^(-0.8)) / 0! = e^(-0.8)
Numerically, e^(-0.8) is approximately 0.4493.
(b) Probability of at least two breakdowns
We can compute P(X>=2) as 1 - P(X<=1). This means:
P(X>=2) = 1 - [P(X=0) + P(X=1)]
We already have P(X=0)=0.4493. Then:
P(X=1) = (0.8^1 * e^(-0.8)) / 1! = 0.8 * e^(-0.8)
This is approximately 0.3595, so:
P(X=0) + P(X=1) ~ 0.4493 + 0.3595 = 0.8088
Therefore,
P(X>=2) = 1 - 0.8088 = 0.1912
Practical Implementation in Python
import math
# Poisson parameter
lmbda = 0.8
# Probability of no breakdowns
p0 = math.exp(-lmbda) # e^(-0.8)
# Probability of exactly 1 breakdown
p1 = lmbda * math.exp(-lmbda) # 0.8 * e^(-0.8)
# Probability of at least 2 breakdowns
p_ge_2 = 1 - (p0 + p1)
print("P(X=0) =", p0)
print("P(X>=2) =", p_ge_2)
Follow-up Questions
Why is the Poisson approximation valid here?
The Poisson approximation to the Binomial distribution works well if n is large and p is small, specifically when np remains a moderate number (often stated as less than or around 10). In this scenario, n=2000 and p=0.0004, giving np=0.8, which is small enough that the distribution is well-approximated by a Poisson.
Additionally, the Poisson assumption requires that each trial is effectively independent, and the probability of a breakdown does not change significantly across trials. This can be reasonable in many real-world cases, but if conditions in the tunnel changed over time, or if vehicles influenced each other’s breakdown probability, the assumption might become less accurate.
When should I use the Binomial distribution instead of the Poisson?
If p is not very small, or if n is not sufficiently large, or if you need precise estimates in the tails of the distribution, using the exact Binomial formula can provide more accurate results. Also, if the total number of vehicles is not that large or if each breakdown probability is not very small, the Poisson approximation may no longer be suitable. In such cases, computing the Binomial probabilities exactly might be preferable.
Could the Poisson approximation lead to underestimation or overestimation in certain ranges?
Yes, the Poisson approximation might differ slightly from the exact Binomial probabilities, especially in the tails (very high or very low counts). However, for small lambda (such as 0.8) and under standard conditions, the difference is usually negligible for practical purposes.
How to interpret the result in a real-world scenario?
A probability of roughly 0.4493 for zero breakdowns in an hour indicates that on nearly half the hours, you’ll see no breakdowns at all.
A probability of 0.1912 for at least two breakdowns suggests that on around 19% of the hours, there will be two or more breakdowns, potentially causing more significant traffic disruptions.
What if the breakdown probability changes over time or depends on other factors?
If breakdown probabilities vary, a single parameter p may not capture the changes accurately. You might need more complex models, such as non-homogeneous Poisson processes or time-dependent parameters. Also, if there is correlation among vehicle breakdowns (for example, one breakdown triggering others), the independence assumption would be violated, potentially invalidating both the Binomial and standard Poisson approaches.
Could we use confidence intervals or hypothesis testing here?
In many real-world ML or data science tasks, you might also want to construct confidence intervals for the probability p if it is estimated from data. For large n, one might use a normal approximation to the Binomial or specialized methods to build a confidence interval around p. You might also test hypotheses about whether p is equal to some expected value, or about the tunnel environment changing the probability over time.
Are there any edge cases to watch out for?
If p were drastically overestimated or underestimated (for instance, due to poor sampling of breakdowns in previous hours), the calculated probabilities might be far from reality.
If the tunnel experiences seasonal or time-of-day effects, a constant rate assumption could be misleading.
Very large changes in traffic volume might push n*p to a range that either invalidates the Poisson approximation (if it becomes too large) or challenges the original assumptions about independence.
These considerations are crucial in real-world ML and data science interviews to show depth of understanding of when assumptions hold and when they might break.