ML Interview Q Series: Probabilistic Tank Sizing: Determining Capacity via CDF and Percentile Analysis.
Browse all the Probability Interview Questions here.
10E-4. Liquid waste produced by a factory is removed once a week. The weekly volume of waste (in thousands of gallons) is a continuous random variable with probability density function
for 0 < x < 1, and f(x) = 0 otherwise. How should you choose the capacity of a storage tank so that the probability of overflow during a given week is no more than 5%?
Short Compact solution
The cumulative distribution function is found by integrating the PDF from 0 to x. This yields
for 0 <= x <= 1.
To ensure that the probability of overflow is at most 5%, we need 1 - F(x) = 0.05, which implies F(x) = 0.95. Solving that equation numerically gives x = 0.8712. Therefore, the storage tank must have a capacity of 0.8712 thousand gallons so that the probability of overflow is at most 5%.
Comprehensive Explanation
Understanding the Problem Setting
We have a random variable X that denotes the weekly volume of liquid waste produced, measured in thousands of gallons. The PDF, f(x) = 105 x^4 (1 - x)^2, applies over the domain 0 < x < 1. Outside of this interval, the PDF is 0, indicating that the waste volume cannot exceed 1 thousand gallons or drop below 0 thousand gallons.
Deriving the Cumulative Distribution Function
A cumulative distribution function (CDF), denoted F(x), is the probability that the random variable X is less than or equal to x. By definition, for a continuous random variable:
F(x) = ∫ from 0 to x of f(t) dt.
Given f(t) = 105 t^4 (1 - t)^2, we have F(x) = ∫ from 0 to x of 105 t^4 (1 - t)^2 dt.
When you perform this integration (by expanding t^4 (1 - t)^2, integrating term by term, and factoring out constants), you get:
F(x) = x^5 (15 x^2 - 35 x + 21).
For 0 <= x <= 1, this expression provides us with the probability that the waste volume stays at or below x thousand gallons.
Determining the Required Tank Capacity
We want the probability of exceeding the tank’s capacity to be no more than 5%. Mathematically, we say:
P(X > capacity) <= 0.05.
But since P(X > capacity) = 1 - F(capacity), we need:
1 - F(capacity) <= 0.05 ⟹ F(capacity) >= 0.95.
So we solve the equation F(x) = 0.95 for x in the interval (0, 1). Substituting the derived expression for F(x), we get:
x^5 (15 x^2 - 35 x + 21) = 0.95.
This can be solved either using numerical approximation methods or by symbolic solvers. The solution in (0,1) is x = 0.8712. Interpreted back in the problem context, X is measured in thousands of gallons, so the tank capacity that ensures a 5% or lower chance of overflowing is 0.8712 thousand gallons (i.e., 871.2 gallons).
Why This Is a 95th Percentile (Quantile)
Because we want only a 5% chance that the weekly waste volume will exceed the tank capacity, the capacity must correspond to the 95th percentile (i.e., the value where 95% of the distribution is below that point).
Practical Considerations
Safety Margins: In real-world settings, factories might include additional margins on top of the 95th percentile to account for uncertainties in measurement, changes in production processes, or maintenance issues.
Numerical Solution: While here the integral of the PDF provides a closed-form solution, for more complex PDFs (or distributions without straightforward antiderivatives), you might need numerical integration and root-finding techniques (e.g., Newton-Raphson, bisection method) to locate the capacity.
Interpretation of Units: Always be mindful that the problem statement works in “thousands of gallons.” A capacity of 0.8712 in those units corresponds to 871.2 gallons in standard units.
Potential Follow-up Question 1
Could we have recognized this distribution as a special case of the Beta distribution?
Yes. The PDF resembles a Beta distribution of the form: Beta(5, 3), which generally looks like x^(α-1) (1 - x)^(β-1). Specifically, Beta(5,3) has a PDF proportional to x^(4)(1-x)^(2). The normalizing constant for Beta(α,β) is Γ(α+β)/[Γ(α)Γ(β)], which, for α=5 and β=3, becomes 4! * 2! / (4! * 2!) = 1 / B(5,3). The factor 105 is exactly the constant that ensures total integral = 1. In fact, 105 is the reciprocal of B(5, 3). Recognizing that the distribution is Beta(5,3) means the CDF is the regularized incomplete Beta function. The 95th percentile can be found through tables or known computational functions for Beta distributions.
Potential Follow-up Question 2
What if the required overflow probability threshold changes?
If management decides that the probability of overflow should be less than 1% instead of 5%, you would solve:
F(x) = 0.99
instead of 0.95. The exact same principle applies; you just target a different percentile, known as the 99th percentile in that case.
Potential Follow-up Question 3
How would you solve for x if the integral did not simplify nicely?
In many practical situations, the integral of f(x) might not have a nice closed-form expression. In such cases, you can use:
Numerical integration to evaluate F(x).
Root-finding algorithms (bisection method, secant method, or Newton-Raphson) to solve F(x) = 0.95.
Python’s scipy
library, for instance, provides scipy.integrate.quad
for integration and scipy.optimize.fsolve
or scipy.optimize.root_scalar
for root-finding.
A straightforward approach would be:
import numpy as np
from scipy.integrate import quad
from scipy.optimize import bisect
def pdf(x):
return 105 * x**4 * (1 - x)**2 if 0 < x < 1 else 0
def cdf(x):
# numerical integration from 0 to x
val, _ = quad(pdf, 0, x)
return val
def find_capacity(target=0.95):
# we want cdf(x) = target
# define a function for cdf(x) - target
def froot(x):
return cdf(x) - target
# Use bisection in [0,1]
return bisect(froot, 0, 1)
x_sol = find_capacity(0.95)
print("Capacity for 95% coverage:", x_sol)
This approach works even if the PDF does not have a closed-form CDF.
Potential Follow-up Question 4
What happens if the factory’s production process changes and X might exceed 1 thousand gallons?
The PDF given is strictly 0 outside the interval (0,1). If the actual waste volume could become larger than 1 thousand gallons because of changes, then this model would no longer be accurate. In practice, you would:
Gather new data reflecting the updated production processes.
Re-estimate the distribution parameters or form.
Derive a new PDF or at least compute empirical or parametric estimates for the tail probabilities.
Re-calculate the capacity to ensure that overflow probability remains at or below 5%.
This step highlights the importance of continuous monitoring and re-estimating statistical models as processes evolve over time.
Below are additional follow-up questions
How would you handle uncertainty in the PDF parameters if they were estimated from historical data?
In real-world settings, the true PDF might not be exactly known and is typically estimated from sample data. When you estimate parameters (like the Beta distribution parameters) from a finite sample, there is statistical uncertainty around these estimates. This uncertainty can translate into uncertainty around the 95th percentile. A few methods to address this are:
Confidence Intervals for Quantiles: Instead of a single point estimate of the 95th percentile, you can compute a confidence interval to see how the estimate might vary. Methods include bootstrapping or asymptotic approximations. Then you can pick a capacity that safely covers the upper range of this interval to reduce the risk of underestimating the true 95th percentile.
Bayesian Approach: You can place prior distributions on the parameters of the PDF. After observing data, you update the priors to posteriors and then get a posterior distribution on the 95th percentile. This approach can incorporate prior knowledge and produce a full distribution of the quantile estimate.
Sensitivity Analyses: You might assume multiple plausible PDFs (e.g., different parameter sets) and see how the 95th percentile changes. This process helps you understand potential worst-case scenarios if your PDF assumptions are slightly off.
A potential pitfall is failing to account for the inherent sampling variability in the PDF estimation. If you base your capacity purely on a single best-fit PDF without considering parameter uncertainty, you might under-provision for random weeks that deviate from the historical pattern.
What if the factory experiences seasonal variations or weekly correlations in waste volume?
Real-world processes often exhibit seasonality or trends over time (e.g., higher production in certain months), and this can lead to variations that a single static PDF may fail to capture. Additionally, the waste volume produced in consecutive weeks might be correlated (e.g., a high production week may carry over certain conditions to the next week).
To address these complexities:
Seasonal Models: Rather than using a single PDF for all weeks, you can partition the data into seasonal segments (e.g., summer vs. winter) and estimate separate probability distributions for each segment. Then, for each season, compute the 95th percentile.
Time-Series Methods: If volumes from week to week show correlations, consider models such as ARIMA (AutoRegressive Integrated Moving Average) or state-space models, which explicitly capture autocorrelation and trends. You can forecast the next week’s volume and derive predictive intervals that incorporate the historical correlation structure.
Dynamic Capacity Planning: If the seasonality is strong, you might use a different capacity in different periods or augment your storage with a flexible arrangement (e.g., temporary tanks) during peak seasons.
A big pitfall is applying a single distribution year-round when in reality the waste volume follows different patterns depending on external factors (holidays, climate, supply chain issues). Ignoring time-dependent aspects can lead to chronic underestimation or overestimation of capacity needs in certain periods.
How do you account for the financial trade-offs between tank capacity and the cost of overflow?
When deciding on capacity, there is usually a cost associated with having a bigger tank (capital expenses, maintenance, and space requirements). On the other hand, there is also a penalty or cost associated with overflowing (potential environmental fines, cleanup costs, and reputational damage). In an industrial context, you can formulate this as an optimization problem:
Expected Cost of Overflow: Suppose each overflow event costs a certain penalty. Multiply that penalty by the probability of overflow in a given week (and further multiply by the number of weeks in the tank’s lifetime).
Total Cost: Combine the capital cost of the tank (which increases with capacity) with the expected overflow cost. Then choose a capacity that minimizes total expected cost.
Risk Tolerance: Different companies have different tolerances for risk. Some might be willing to pay extra for a near-zero probability of overflow (e.g., 99.99th percentile) while others may accept a higher chance of overflow if the penalty is not too large.
An important pitfall here is failing to systematically quantify the cost trade-offs. If you only look at a single percentile constraint without also analyzing cost–benefit considerations, you might end up with either over-capacity or under-capacity relative to the company’s financial and operational priorities.
Is it valid to apply Chebyshev’s or Markov’s inequality as a shortcut to estimate the required capacity?
Markov’s and Chebyshev’s inequalities can give (usually quite loose) bounds on tail probabilities without needing the exact distribution. For instance:
Markov’s Inequality: P(X >= a) <= E(X) / a if X >= 0.
Chebyshev’s Inequality: P(|X - mean| >= k * std) <= 1/k^2.
While these might provide quick estimates, they are often too conservative for practical engineering decisions. In this problem, we have an explicit PDF, so exact computations of the tail probability are far more accurate.
A common pitfall is relying on these inequalities to set capacity in complex situations when one does not have a precise distribution. It can lead to substantial overestimation of capacity requirements because these inequalities are upper bounds and not tight for many distributions.
How would you verify that your chosen PDF and computed quantile align with real operational data?
Even if the theoretical derivation is mathematically correct, it’s crucial to validate the PDF and the resulting quantile estimates with real-world data:
Back-testing: Collect weekly volumes over several months or years and compare the observed frequency of exceeding the chosen capacity with the targeted 5%. If the actual overflow rate is consistently above 5%, it indicates the model or parameters might be inaccurate.
Goodness-of-Fit Tests: Conduct statistical tests (e.g., Kolmogorov–Smirnov test, Anderson–Darling test) to see how well your chosen distribution fits the empirical data. If the p-values are too low, reconsider the distributional assumptions.
Residual Analysis: If you create a predictive time-series or regression model for X, analyze the residuals (difference between predicted and actual volumes) to see if they appear random with no obvious patterns over time.
A subtle pitfall is to assume that once a PDF is found to be a good fit for historical data, it will remain so indefinitely. Industrial processes change, and you need ongoing monitoring to detect drift in the distribution.
Could we address the question using a non-parametric approach instead of assuming a Beta-type PDF?
Yes. If you do not want to assume a specific functional form (like the Beta distribution), you can use:
Empirical CDF: Rank the observed values in historical data from smallest to largest, and estimate the 95th percentile directly from the empirical quantile. This is the simplest approach but can be limited by the size of your dataset.
Kernel Density Estimation (KDE): Use KDE to get a smooth approximation of the probability density function from historical data. Then numerically integrate to find the 95th percentile.
Quantile Regression: If you have explanatory variables (e.g., temperature, production shifts), you can directly model the 95th percentile as a function of those covariates, making no strong assumptions on the error distribution.
Pitfalls include needing a sufficiently large dataset for reliable empirical or non-parametric methods. You may also face issues if the data is highly skewed or if extreme values have not been observed (creating an underestimation of tail risk).
How would you generalize this approach if you wanted to plan multiple tanks for waste separation or different waste streams?
Sometimes factories have more than one waste stream with different chemical properties or disposal schedules:
Independent Streams: If each waste stream is independent, you might calculate the 95th percentile separately for each one. Then, design separate tanks each sized to keep that stream’s overflow probability below 5%.
Correlated Streams: If the volumes of different streams are correlated (e.g., if an upstream process surge creates a surge in multiple waste products simultaneously), you must consider the joint probability distribution. Instead of a single scalar capacity, you need a multidimensional approach or sum the correlated streams to get an overall capacity requirement.
Aggregate Risk Management: In some cases, you might route multiple waste streams to a single large tank, which changes the distribution of total volume. You would then model X = X1 + X2 + … + Xn, summing random variables from different waste streams, and compute the distribution’s 95th percentile for that sum.
A subtle challenge arises if the processes for each waste stream are not well understood or have different degrees of variability. Merging them incorrectly might result in either over-provisioning or an unexpected overflow if their peak times coincide.
How do you handle extremely rare but catastrophic overflow events (beyond the usual 95th percentile planning)?
Some industries must plan for low-probability but high-impact events. For example, chemical spills can have severe environmental and legal consequences. Even if the 95th percentile is well below capacity, there might be a 1% or 0.1% event that is catastrophic. To address these:
Extreme Value Theory (EVT): Instead of focusing on the bulk of the distribution, you specifically model the tail (upper extremes) with distributions like the Generalized Pareto Distribution. This approach is better suited for predicting very high thresholds (e.g., the 99.9th percentile).
Scenario Planning: Identify worst-case scenarios (equipment failures, supply chain disruptions, sudden surges in production). Estimate their probability or treat them as “worst-case deterministic” scenarios needing special emergency handling plans (secondary containment, overflow alarms, etc.).
Risk Appetite: Senior management might decide that certain catastrophic events, even at 0.1% probability, are unacceptable, prompting design changes or additional safety measures.
A pitfall is to over-focus on typical weekly variations while neglecting the possibility of rare outliers that exceed even the 99th percentile. Such an event, though extremely unlikely, may be too risky to ignore depending on regulatory and safety requirements.