ML Interview Q Series: Expected Cost Calculation for Waste Overflow Using Probability Density Functions.
Browse all the Probability Interview Questions here.
10E-5. Consider this Problem
Liquid waste produced by a factory is removed once a week. The weekly volume of waste (in thousands of gallons) is a continuous random variable with probability density function
for 0 < x < 1, and f(x) = 0 otherwise. We wish to choose a storage tank capacity so that the probability of overflow during a given week is at most 5%. Assume the chosen capacity is 0.9 (in thousands of gallons). The weekly cost function can be written as a piecewise function of x (the volume of waste):
If the volume does not exceed 0.9 (i.e., 0 < x < 0.9), the cost is 1.25 + 0.5x.
If the volume exceeds 0.9 (i.e., 0.9 < x < 1), there is an additional overflow penalty of 5 + 10(z), where z = x - 0.9.
Hence the cost function g(x) is:
$$g(x) ;=;
\begin{cases} 1.25 + 0.5,x, & 0 < x < 0.9,\[6pt] 1.25 + 0.5 \times 0.9 ;+; 5 ;+; 10,\bigl(x - 0.9\bigr), & 0.9 < x < 1,\[6pt] 0, & \text{otherwise}. \end{cases}$$
What are the expected value and the standard deviation of the weekly costs under this cost function?
Short Compact solution
The random variable for weekly cost is Y = g(X). Its expectation is computed via
A direct piecewise evaluation of the integral yields
Expected cost: 1.69747 (approximately).
To find the standard deviation, we first compute
which evaluates to 3.62044. Then
Var(Y) = E[Y²] - (E[Y])² = 3.62044 - (1.69747)², and taking the square root yields the standard deviation of approximately 0.8597.
Comprehensive Explanation
Distribution of X
We have X ~ a continuous distribution on the interval (0,1) with probability density function (PDF)
f(x) = 105 x^4 (1 - x)^2 for 0 < x < 1, and f(x) = 0 otherwise.
This distribution is reminiscent of a Beta distribution (a Beta(5,3) shape up to a constant), where the normalizing constant 105 ensures the PDF integrates to 1 over x in (0,1). One can verify quickly:
∫[0 to 1] 105 x^4 (1 - x)^2 dx = 1
ensuring that f(x) is valid.
Cost Function g(x)
We define Y = g(X) as the weekly cost. The capacity of the tank is 0.9 (thousands of gallons). Hence:
For 0 < x < 0.9, there is no overflow, and the cost is 1.25 + 0.5 x.
For 0.9 < x < 1, there is an overflow of (x - 0.9). The cost includes both the usual removal cost (1.25 + 0.5 × 0.9) plus the overflow penalty of (5 + 10(x - 0.9)). Thus for x > 0.9, g(x) = 1.25 + 0.45 + 5 + 10(x - 0.9) = 10x - 2.3.
For x outside (0,1), the cost is zero, but X never falls outside (0,1) under this PDF.
Computing the Expectation E[Y]
We use the integral:
Because f(x) = 105 x^4 (1 - x)^2 for 0<x<1, we write
E[Y] = 105 ∫(0 to 1) g(x) x^4 (1 - x)^2 dx.
However, g(x) is piecewise:
g₁(x) = 1.25 + 0.5 x for 0<x<0.9,
g₂(x) = 10x - 2.3 for 0.9<x<1.
Hence, E[Y] = 105 [ ∫(0 to 0.9) (1.25 + 0.5 x) x^4 (1 - x)^2 dx + ∫(0.9 to 1) (10x - 2.3) x^4 (1 - x)^2 dx ].
Evaluating these two integrals (whether by analytical methods or numerical integration) yields approximately 1.69747.
Computing the Second Moment E[Y²]
We also need E[Y²] = ∫(0 to 1) (g(x))² f(x) dx = 105 ∫(0 to 1) (g(x))² x^4 (1 - x)^2 dx.
Again split the interval:
For 0<x<0.9, (g₁(x))² = (1.25 + 0.5 x)².
For 0.9<x<1, (g₂(x))² = (10x - 2.3)².
Numerical or symbolic integration yields E[Y²] ≈ 3.62044.
Variance and Standard Deviation
Variance(Y) = E[Y²] - (E[Y])² = 3.62044 - (1.69747)². Numerically,
(1.69747)² ≈ 2.8854, so Var(Y) ≈ 3.62044 - 2.8854 = 0.73504, and the standard deviation is √0.73504 ≈ 0.8579 (the given solution reports about 0.8597, which is the same up to rounding).
Hence,
E[Y] ≈ 1.69747,
std(Y) ≈ 0.8597.
Potential Follow-up Questions
1) How would you confirm that the PDF f(x) is valid?
You could verify ∫(0 to 1) 105 x^4 (1 - x)^2 dx = 1. This can be done by recognizing it as a Beta function form or by straightforward integration. Concretely,
∫(0 to 1) 105 x^4 (1 - x)^2 dx = 105 × ( (Γ(5)Γ(3)) / Γ(8) ) in Beta function terms, which indeed equals 1.
2) Could you have computed these integrals analytically rather than numerically?
Yes, for piecewise polynomials times x^4(1-x)^2, one could expand and integrate term by term. It is somewhat tedious, but it can be done exactly via symbolic integration. In practice, most interviewees would do a numeric approach or rely on a symbolic tool to handle the integrals.
3) How might you implement this solution in Python?
You could perform numeric integration with libraries like scipy.integrate.quad
:
import numpy as np
from scipy.integrate import quad
def f(x):
return 105*x**4*(1 - x)**2 if 0 < x < 1 else 0
def g(x):
if 0 < x < 0.9:
return 1.25 + 0.5*x
elif 0.9 <= x < 1:
return 10*x - 2.3 # same as: 1.25 + 0.5*0.9 + 5 + 10*(x-0.9)
else:
return 0
def integrand_for_EY(x):
return g(x)*f(x)
def integrand_for_EY2(x):
return (g(x)**2)*f(x)
E_Y, _ = quad(integrand_for_EY, 0, 1)
E_Y2, _ = quad(integrand_for_EY2, 0, 1)
var_Y = E_Y2 - (E_Y**2)
std_Y = np.sqrt(var_Y)
print("E[Y] =", E_Y)
print("Std(Y) =", std_Y)
4) How does changing the storage tank capacity affect these costs?
If you reduce capacity below 0.9, overflow happens more frequently, so the penalty portion (5 + 10z) is triggered more often, increasing the expected cost (though the base removal cost might be smaller if the waste volume is typically large).
Increasing the capacity above 0.9 might reduce overflow penalty but could change other logistical costs. The problem’s statement fixes capacity at 0.9, likely from the earlier requirement that overflow probability must not exceed 5%.
5) Could you directly verify that overflow probability is ≤ 5% at capacity 0.9?
Yes, check P(X > 0.9) = ∫(0.9 to 1) 105 x^4 (1-x)^2 dx. Numerically, it is about 0.05. That is how we confirm that a capacity of 0.9 ensures no more than 5% chance of overflow.
6) What if the cost function or distribution changed?
You would redo the same steps:
Adjust g(x) to reflect the new cost structure.
Recompute E[Y] and E[Y²] via integrals with the updated PDF.
The method remains identical; only the piecewise function or the PDF’s parameters differ.
These considerations confirm the depth of your understanding of continuous distributions, expected values, and piecewise cost functions in real-world industrial processes.
Below are additional follow-up questions
How would you handle a situation where the actual waste production distribution changes over time, potentially invalidating the assumption that X follows 105 x^4 (1-x)^2?
One subtle but realistic scenario is when the industrial process evolves or seasonal fluctuations change the waste volume distribution. If the actual distribution drifts away from 105 x^4 (1-x)^2, your computed expectation and overflow probability estimates become unreliable.
In practice, you would:
Collect new sample data on waste volumes periodically (e.g., every few months).
Fit a new distribution or re-estimate the parameters of your assumed model.
Re-evaluate the 5% overflow capacity requirement and the expected/variance of the weekly cost based on the updated distribution.
If significant drift is detected, you might need to adjust the storage capacity or re-negotiate service contracts.
A common pitfall is failing to monitor whether the distribution’s parameters remain stable, leading to inaccurate cost or risk estimates over time.
What if the penalty cost function changes to a different structure, for instance a fixed cost penalty only beyond a certain threshold, or a nonlinear relationship in overflow?
In many real-world settings, cost structures are not always linear in the overflow volume. You might have step functions (e.g., a flat surcharge whenever there is any overflow, plus some more complex penalty for large overflows) or concave/convex nonlinear surcharges.
To adapt, you simply redefine g(x) accordingly and recalculate:
The new cost function piecewise for different intervals of x.
Integrate g(x) f(x) for E[Y].
Integrate (g(x))^2 f(x) for E[Y²].
Derive Var(Y) and std(Y).
A key pitfall is forgetting to re-implement those piecewise transitions correctly in code or symbolic integration, especially if there are multiple “breakpoints.” One must carefully stitch together each segment to capture the correct cost in every overflow scenario.
Could there be numerical stability issues or edge cases in the integration when x is very close to 0 or 1?
Indeed, x^4(1-x)^2 can become very small near 0 or 1, and if you are integrating numerically, floating-point precision might be a concern. While the PDF itself is well-behaved on (0,1), you may still encounter:
Very small floating-point values near the boundaries (0 or 1).
Rounding errors that can accumulate if the integrator algorithm or the step sizes are not chosen carefully.
To mitigate such issues:
Use adaptive quadrature methods in libraries like
scipy.integrate.quad
which handle endpoints carefully.If performing manual numerical integration (e.g., Riemann sums), choose a mesh that is fine near the boundaries or apply transformations (like the logit transform) to reduce boundary effects.
How would you estimate the expected cost if you only had sampled data of weekly waste volumes (rather than knowing an analytical PDF)?
If you lack an explicit formula for f(x) but have empirical samples x_1, x_2, ..., x_n from the true distribution, you can estimate E[Y] using the sample mean of g(x_i). Concretely:
Estimate of E[Y] = (1/n) Σ from i=1 to n of g(x_i),
where g(x_i) is the cost function evaluated at sample point x_i. Similarly, you can estimate E[Y²] = (1/n) Σ of (g(x_i))^2, then get Var(Y) = E[Y²] - (E[Y])².
Potential pitfalls include:
Not having a sufficiently large sample, leading to high variance in estimates.
Sampling bias if the data comes from an unusual time period (e.g., a maintenance shutdown period).
If you suspect the distribution is non-stationary, older data might not reflect current production behavior.
How would you account for maintenance periods or outlier weeks where production might be extremely low or extremely high?
Real-world production can occasionally have outliers or special events (like equipment failures, unusual demand surges, or planned downtime). One approach is to treat these rare weeks as separate events, each with a small probability but potentially large waste volume. Alternatively, you can incorporate a mixture model:
The main distribution capturing typical weekly volumes.
A secondary distribution component for “rare event” weeks with much higher or lower volumes.
When deriving E[Y], you then integrate over both components. The main pitfall is ignoring or underestimating the probability of these “tail” events, resulting in an unexpectedly high chance of overflow. Companies often run stress tests or scenario analyses to see how a few outlier weeks might drastically increase the overall cost due to large penalty charges.
How would you incorporate a confidence interval around the expected weekly cost or around the probability of overflow?
Even after computing E[Y] from the integral (or from sample-based estimation), decision-makers often want an uncertainty measure. If you rely on the analytical PDF, you can consider:
Bootstrapping simulations of X from the known distribution f(x).
Calculating the sample distribution of E[Y] across many Monte Carlo runs.
This yields an empirical confidence interval for E[Y]. Similarly, you can compute the fraction of runs that overflow to construct a confidence band for P(X > 0.9).
A pitfall is to assume that because we have an exact formula for f(x), there is no uncertainty. However, in real operations, the PDF parameters may themselves be estimates, or the waste generation process may not perfectly follow the theoretical distribution.
How would capacity planning differ if there is a possibility of storing partial waste across weeks (i.e., carryover) rather than emptying the tank fully each week?
If the tank is not fully emptied each week, the current volume in the tank plus the new waste X must not exceed capacity. The cost function might then involve additional terms for partial removals or incremental fees. You would need to redefine:
The random variable that tracks the new total volume = leftover + new production.
The cost function that may penalize only if total volume exceeds capacity.
This complicates computations because you must consider the distribution of carryover from prior weeks. A Markov chain approach can track the distribution of the leftover volume, updating each week. The main pitfall is ignoring these state-based dependencies; a simple one-week analysis might underestimate overflow probability if the tank is rarely emptied completely, leading to repeated partial carryovers.