ML Interview Q Series: Normal Distribution: Probability within Standard Deviations Using Z-Score Standardization.
Browse all the Probability Interview Questions here.
Suppose X is N(μ, σ²). For a = 1, 2, 3, find P(|X - μ| < aσ).
Short Compact solution
We have P(|X - μ| < aσ) = P(-a < Z < a) = 2Φ(a) - 1. Substituting a = 1, 2, 3, we get approximately 0.682, 0.954, and 0.998, respectively.
Comprehensive Explanation
Here,
X is a normal random variable with mean μ and variance σ².
Z is a standard normal random variable with mean 0 and variance 1.
a is the number of standard deviations we are considering around the mean.
Φ(a) is the cumulative distribution function (CDF) for the standard normal distribution evaluated at a.
In more detail:
We standardize the original variable X: if X ~ N(μ, σ²), then Z = (X - μ)/σ follows N(0, 1).
We want the probability that X is within aσ of the mean μ. This translates to: -a < (X - μ)/σ < a which is -a < Z < a in terms of Z.
The probability that a standard normal variable Z lies between -a and a is 2Φ(a) - 1.
For a = 1, 2, 3, these probabilities are very well-known:
a=1 ⇒ 2Φ(1) - 1 ≈ 2×0.8413 - 1 ≈ 0.6826
a=2 ⇒ 2Φ(2) - 1 ≈ 2×0.9772 - 1 ≈ 0.9544
a=3 ⇒ 2Φ(3) - 1 ≈ 2×0.9987 - 1 ≈ 0.9974
These numbers align closely with the classic 68–95–99.7 empirical rule, which states that approximately 68%, 95%, and 99.7% of the data lies within 1, 2, and 3 standard deviations of the mean for a normal distribution.
Potential Follow-up Questions
How would one compute these probabilities in a practical setting using Python?
You can compute these probabilities by accessing the standard normal CDF from libraries such as scipy.stats
. For instance:
from math import sqrt
from scipy.stats import norm
# Suppose we want to compute the probability that X lies within a sigma of its mean for a=1.
a = 1
prob = 2 * norm.cdf(a) - 1
print(prob) # approximately 0.682689477
If you know μ and σ from real-world data, you can similarly standardize any observation x by z = (x - μ)/σ and use norm.cdf(z)
to estimate the probability or percentile of that observation.
Why do these probabilities match the well-known 68–95–99.7 rule?
The 68–95–99.7 rule (also sometimes called the empirical rule) is a concise way of stating that for a perfectly normal distribution, about 68% of the observations lie within 1 standard deviation of the mean, about 95% within 2 standard deviations, and about 99.7% within 3 standard deviations. These approximate percentages come directly from the standard normal distribution table, which is exactly what we use when we compute 2Φ(a) - 1.
What happens if the distribution is not perfectly normal?
If the underlying distribution deviates from normality (for example, if it is skewed or heavy-tailed), then P(|X - μ| < aσ) might differ significantly from the standard normal-based values. The 68–95–99.7 rule would not necessarily hold, especially if there are outliers or a distinctly non-Gaussian shape. In practice, for large sample sizes (by the Central Limit Theorem), sample means often approach normality, but individual data points might not.
How do we handle the case where μ and σ are not known and need to be estimated from data?
In real scenarios, you might not know the true mean μ or the true standard deviation σ. Instead, you would estimate them from your sample:
Estimated mean (the sample mean): m = (1/n) * Σ x_i
Estimated standard deviation (the sample standard deviation): s = sqrt( (1/(n-1)) * Σ (x_i - m)² )
Then you replace μ with m and σ with s in the expressions. Strictly speaking, when you use the sample standard deviation s, the distribution of (X - m)/s is closer to a t-distribution with n-1 degrees of freedom. However, for large n, the t-distribution is very close to the standard normal distribution, so 2Φ(a) - 1 remains a good approximation.
Are there edge cases or pitfalls?
One edge case is if σ = 0, which would happen if all data points are identical. In that degenerate situation, the normal distribution assumption breaks down completely, because you have no variation in the data. Another issue arises if the data is heavily outlier-prone or has a strong skew. In such cases, the normal-based approximation might mislead you about how much data lies near the mean.
How does this relate to confidence intervals?
The expression P(|X - μ| < aσ) can be interpreted in terms of a confidence region for X when X is normally distributed. For instance, 2Φ(a) - 1 is the coverage probability that the random variable X lies within aσ of the true mean. The concept is similar to confidence intervals for the mean itself, except that intervals for the mean often use the Central Limit Theorem and standard error (σ / sqrt(n)) when dealing with sample means. Nonetheless, the logic that relies on the standard normal distribution is similar.
Could we generalize to any continuous distribution?
While these specific numerical probabilities (0.682, 0.954, 0.998) rely on normality, one can consider a more general question: “What fraction of data lies within a certain number of standard deviations of the mean?” Without normality assumptions, Chebyshev’s inequality guarantees that for any distribution with finite mean and variance, at least 1 - 1/a² of the data lies within a standard deviation range of the mean for a>1. However, that bound is often too loose to be practically useful compared to the tighter normal-based rule.
Why is 2Φ(a) - 1 a “two-sided” probability?
The formula 2Φ(a) - 1 calculates the probability of lying in the interval [-a, a] under the standard normal curve. It is “two-sided” because it accounts for deviations both above and below the mean. In z-space, “above the mean” corresponds to positive z-values, and “below the mean” corresponds to negative z-values, so the probability is essentially the total probability from -a up to +a.
Are these values exact or approximations?
Mathematically, 2Φ(a) - 1 is an exact expression for P(-a < Z < a) if Z is exactly standard normal. However, the decimal values 0.682, 0.954, and 0.998 are often quoted in a rounded form. More precise computations (for example, 2Φ(3) - 1 ~ 0.9973002039) simply refine the decimal places.
How might you verify this rule on a dataset?
If you have a large dataset and believe it is approximately normally distributed, you could:
Compute the sample mean m and sample standard deviation s.
For each point x, compute the z-score (x - m)/s.
Count how many points lie within ±1, ±2, and ±3 z-scores.
Compare those empirical fractions to the theoretical values (0.682, 0.954, 0.998).
If the data truly resembles a normal distribution, you should see percentages close to those theoretical benchmarks.