ML Interview Q Series: Bayesian Inference for Loaded Die Probability using a Truncated Prior
Browse all the Probability Interview Questions here.
Your friend has fabricated a loaded die. In doing so, he has first simulated a number at random from the interval (0.1, 0.4). He tells you that the die is loaded so that any roll of the die results in the outcome 6 with probability equal to this simulated number. Next, the die is rolled 300 times and you are informed that the outcome 6 has appeared 75 times. What is the posterior density of the probability that a single roll of the die gives a 6?
Short Compact solution
The prior density is uniform over the interval (0.1, 0.4). Since the uniform density on (0.1, 0.4) is 10/3, and the likelihood from observing 75 sixes in 300 rolls is proportional to θ^75 (1−θ)^225, the posterior density is proportional to the product of prior and likelihood. Therefore, for 0.1 < θ < 0.4, the posterior density is:
f(θ | data) = [ θ^75 (1−θ)^225 ] / [ ∫(0.1 to 0.4) x^75 (1−x)^225 dx ]
and f(θ | data) = 0 otherwise.
Comprehensive Explanation
Prior and Likelihood
Because the loaded die's probability of rolling a 6, denoted by θ, was drawn uniformly from (0.1, 0.4), the prior density for θ is constant on that interval:
Prior density f₀(θ) = 10/3 for 0.1 < θ < 0.4
f₀(θ) = 0 otherwise
Once we roll the die 300 times and observe 75 occurrences of the outcome 6, the likelihood function for θ (treating each roll as an independent Bernoulli trial) is:
Likelihood L(data | θ) = θ^(75) (1−θ)^(225)
Posterior Derivation
Bayes’ theorem tells us the posterior density is proportional to the prior times the likelihood. Explicitly, for 0.1 < θ < 0.4:
f(θ | data) ∝ f₀(θ) × L(data | θ) = (10/3) × θ^75 (1−θ)^225
However, we must normalize this over the interval (0.1, 0.4) so that the posterior integrates to 1. Hence, the normalizing constant is the integral of the unnormalized function over 0.1 < θ < 0.4:
∫(0.1 to 0.4) θ^75 (1−θ)^225 dθ × (10/3)
But notice that the (10/3) factor in the numerator cancels with the same factor in the denominator integral. So the final posterior density on (0.1, 0.4) (and 0 outside) can be written as:
In words, the posterior density is just a truncated Beta-like function—namely θ^(75)(1−θ)^(225)—restricted to the interval (0.1, 0.4), with the denominator acting as the normalizing constant.
Posterior Mode
A common point estimate is the posterior mode. For a Beta(α, β) distribution on (0,1), the mode is (α−1)/(α+β−2). Our posterior is essentially truncated, but if the mode from the unconstrained Beta falls strictly inside the interval (0.1, 0.4), that will also be the mode of the truncated version.
Here, we can think of the non-truncated posterior as Beta(76, 226). That distribution’s mode would be (76−1)/(76+226−2) = 75/300 = 0.25. Since 0.25 is within (0.1, 0.4), the posterior mode remains 0.25.
Posterior Credible Intervals
If one wants a 95% credible interval, we would look for the interval (θ_lower, θ_upper) in (0.1, 0.4) such that:
∫(0.1 to θ_lower) f(θ | data) dθ = 0.025 ∫(θ_upper to 0.4) f(θ | data) dθ = 0.025
(and 0.95 probability in between). Numerically, these work out to about (0.2044, 0.3020). That is a straightforward but potentially involved numerical integration task in practice.
Follow Up Question 1: Why is the posterior a truncated Beta distribution rather than a standard Beta distribution?
When we say “Beta distribution,” we typically assume the entire support is (0, 1). However, this problem specifies a prior that is uniform only on (0.1, 0.4). Consequently, instead of spanning (0, 1), the support is restricted to (0.1, 0.4). This causes the posterior to be a Beta-like form but truncated to that interval. The normalization must then be performed only over (0.1, 0.4) rather than over (0, 1).
Follow Up Question 2: How would the posterior change if the prior interval was wider, say (0, 1)?
If we had started with a standard Beta(1, 1) prior (which is uniform over the entire interval (0, 1)), the posterior would become a standard Beta(1 + 75, 1 + 225) = Beta(76, 226) distribution on (0, 1). The primary difference is that you would not truncate to (0.1, 0.4). Instead, the entire distribution would be on (0,1), and the normalizing constant would be the Beta(76, 226) normalization.
Follow Up Question 3: How do we compute the integral in practice?
To find the normalizing constant, you must numerically compute:
∫(0.1 to 0.4) x^75 (1−x)^225 dx
This can be done with numerical integration libraries. For instance, in Python:
import mpmath as mp
def integrand(x):
return x**75 * (1 - x)**225
mp.quad(integrand, [0.1, 0.4])
We would then invert this value in the denominator to find the normalization for f(θ | data).
Follow Up Question 4: Can we obtain a closed-form expression for the normalizing constant?
For a typical Beta(α, β) distribution over (0,1), the normalizing constant is expressed in terms of the Beta function, B(α, β). Since here we integrate only from 0.1 to 0.4 rather than from 0 to 1, the integral does not have a simple closed-form expression in elementary functions. It instead involves the incomplete Beta function:
∫(0.1 to 0.4) x^(75) (1−x)^(225) dx = B_inc(0.1, 0.4; 76, 226)
where B_inc is the incomplete Beta function. While it does not reduce to elementary functions, it is well-supported in statistical software for direct numerical computation.
Follow Up Question 5: What is the posterior predictive probability of rolling a 6 on the next toss?
The posterior predictive probability of rolling a 6 on the next toss, given data, is the expected value of θ under the posterior distribution. For an untruncated Beta(α, β), this is α/(α + β). For the truncated version, it becomes:
E[θ | data] = ∫(0.1 to 0.4) θ f(θ | data) dθ
One would numerically approximate this integral given the truncated posterior. If the prior had been a full Beta(1, 1), then you would compute (1 + 75)/(1 + 75 + 1 + 225) = 76/302. But because of truncation, you need an integral from 0.1 to 0.4 with the posterior density above, which slightly modifies the final answer compared to the pure Beta(76, 226) formula.
Below are additional follow-up questions
Follow Up Question 6: What if the data were extremely different, say 295 sixes out of 300, potentially pushing the posterior outside the prior bounds (0.1, 0.4)?
If we observed 295 sixes out of 300, the empirical fraction of sixes would be 295/300 = 0.9833..., which is far outside the prior support (0.1, 0.4). In a Bayesian framework, the posterior cannot place any mass outside the region where the prior is nonzero. Consequently, the posterior would be forced to concentrate near the upper boundary θ = 0.4.
Because the prior excludes values above 0.4, even extremely strong data indicating a probability near 0.98 would be effectively “overridden” in the sense that we cannot update outside the prior’s support. This scenario can expose a potential pitfall: if the true generating parameter is actually beyond 0.4, our prior assumptions would be incorrect, and the resulting posterior would be heavily biased. Practically, this suggests that whenever we choose a prior with restricted support, we must be sure that the true parameter is unlikely to lie outside that region. Otherwise, a truncated prior may distort posterior inferences if the data strongly contradicts the assumed support.
Follow Up Question 7: Could the maximum-likelihood estimate (MLE) ever lie outside the interval (0.1, 0.4)? If so, how does the posterior handle it?
In a binomial setting, the MLE for θ is the observed fraction of successes, i.e., 75/300 = 0.25 in the example. That MLE (0.25) lies comfortably inside (0.1, 0.4). But if the number of sixes had been large enough, the empirical fraction might lie outside that interval (e.g., 0.05 or 0.45).
When the MLE is outside the prior support, the Bayesian posterior will still remain within (0.1, 0.4), because the posterior density is always zero outside that range by assumption. In effect, the prior “overrides” any likelihood-based push to move outside the supported region. Mathematically, the posterior is proportional to the likelihood times the prior, and if the prior is zero outside (0.1, 0.4), then the posterior must be zero there too. This can be seen as an intentional modeling choice to disallow parameter values beyond the chosen boundaries.
Follow Up Question 8: How would we conduct a formal hypothesis test of fairness (θ = 1/6) under this truncated prior?
A standard Bayesian approach to hypothesis testing might involve comparing models (e.g., a point-mass prior at θ = 1/6 vs. a continuous prior on (0.1, 0.4)). However, because our given prior excludes anything below 0.1, it technically still includes 1/6 ≈ 0.166..., so fairness is within the prior range.
A simple approach is to check the posterior density at θ = 1/6, or to compute a posterior probability region that covers 1/6. We might do the following:
Compute the posterior probability that θ < 1/6 or θ > 1/6.
If a large proportion of the posterior mass is concentrated away from 1/6, that is evidence against fairness.
Alternatively, one could compute a Bayes factor comparing:
Model 1: θ is a fixed 1/6 (point mass).
Model 2: θ is drawn from the truncated prior (0.1, 0.4).
In that case, the marginal likelihoods of each model would be compared. Either way, the existence of a restricted prior support can complicate formal hypothesis testing if the tested values are near or outside the boundaries.
Follow Up Question 9: What if we want a different shape for the prior, say Beta(a, b) truncated to (0.1, 0.4)? How would that change the posterior?
Instead of having a uniform prior on (0.1, 0.4), one might choose a Beta distribution restricted to that same interval. For instance, if we believe smaller values of θ are more likely than larger ones, we might pick something akin to Beta(2, 5) but then truncate it to (0.1, 0.4).
The unnormalized prior in that region would be p(θ) = c × θ^(a−1) (1−θ)^(b−1) for θ in (0.1, 0.4).
We would compute the appropriate normalization by integrating over 0.1 to 0.4.
Once we observe the data, the posterior would be proportional to θ^(a−1 + number_of_sixes) (1−θ)^(b−1 + number_of_failures), restricted and renormalized in (0.1, 0.4).
This approach is more flexible because it encodes a belief about the shape of θ’s distribution within the allowed interval. In practice, one would pick a, b to reflect domain knowledge or any prior beliefs about how heavily loaded the die might be.
Follow Up Question 10: How do we address numerical instability if the exponents in θ^75 (1−θ)^225 become very large in an actual implementation?
When exponents like 75 or 225 appear, direct computation of θ^75 (1−θ)^225 can cause numerical underflow for floating-point arithmetic if θ is small or large. To mitigate this:
Work in log-space: compute log(θ^75 (1−θ)^225) = 75 log(θ) + 225 log(1−θ).
Use library functions that handle the incomplete Beta function or gamma functions.
Normalize carefully so that intermediate results remain in a more stable range.
These numeric strategies are vital for robust posterior estimation. If not addressed, naive implementations can silently fail by producing underflow to zero or overflow to infinity.
Follow Up Question 11: What if the prior was improperly specified so that the interval (0.1, 0.4) does not reflect reality at all?
In Bayesian analysis, the prior should reflect actual beliefs or constraints. If it is discovered (e.g., through domain expertise or the data itself) that (0.1, 0.4) is an unrealistic support, then the entire modeling assumption needs to be revisited.
Possible steps include:
Expanding the prior domain to (0, 1) or to some other interval that better covers potential values.
Using a weakly informative or flat prior to avoid overconstraining the parameter.
Performing a sensitivity analysis to see how different priors might change the posterior.
If the data strongly suggests a value outside (0.1, 0.4), the truncated prior will force the posterior to remain within that region, possibly introducing bias. Detecting such a conflict between prior and data is an important part of Bayesian modeling.
Follow Up Question 12: What happens to the concept of conjugacy in this truncated scenario?
Usually, with a Beta(α, β) prior and a binomial likelihood, the posterior is Beta(α + number_of_successes, β + number_of_failures). That’s a well-known conjugate relationship. However, once we truncate the prior to (0.1, 0.4), the direct Beta-binomial conjugacy no longer holds in the classical sense. We end up with a posterior that resembles a Beta(α + number_of_successes, β + number_of_failures) distribution, but it is renormalized only over (0.1, 0.4).
Thus, strictly speaking, we lose the simple closed-form Beta update formula because we must compute truncated integrals. The posterior is sometimes referred to as a “truncated Beta” distribution, but we can no longer treat it as a standard Beta distribution. Conjugacy is broken by truncation, so we rely on numerical methods to find normalizing constants and credible intervals.
Follow Up Question 13: How might we generalize this approach if we had multiple dice, each with potentially different load parameters?
For multiple dice (say we have D dice, each with its own probability θ_d of rolling a 6), we might place a joint prior that either factors across dice or uses a hierarchical structure to model correlations among dice. For example:
Independent approach: For each die d, assume θ_d ~ Uniform(0.1, 0.4). Then, for data from each die, update the posterior independently.
Hierarchical approach: Suppose θ_d ~ Beta(α, β), but now truncated to (0.1, 0.4), and α, β themselves have a hyperprior. One might attempt to pool information across dice to get a more stable estimate of the typical loading.
In either case, once data are observed for each die, you would multiply or combine the likelihoods subject to the chosen prior constraints, possibly leading to a more complex posterior that would likely require MCMC or advanced sampling techniques for practical inference.
Follow Up Question 14: How do posterior mean, median, and mode compare under this truncated distribution, and why might they differ for finite sample sizes?
Posterior mode: For a non-truncated Beta(76, 226), the mode is (75)/(300) = 0.25. With truncation, if that mode is within (0.1, 0.4), it remains the same. Otherwise, it would be pushed to a boundary.
Posterior mean: For a standard Beta(76, 226), the mean is 76/(76+226)=76/302≈0.2517. Under truncation, we would compute ∫(0.1 to 0.4) θ f(θ|data) dθ, which might shift slightly compared to the full Beta distribution.
Posterior median: This is the 50th percentile of the posterior distribution. Unlike the mode or mean, the median is found by solving ∫(0.1 to m) f(θ|data)dθ = 0.5. That might require numerical methods.
For finite samples, these three measures can differ appreciably, especially when the posterior is skewed or truncated. Over many trials, with large sample sizes, the distribution of θ narrows and these statistics converge, but the truncation can still pull the posterior away from what an unconstrained Beta update might have suggested.