ML Interview Q Series: Bayesian Estimation of Loaded Die Probability with Discrete Priors
Browse all the Probability Interview Questions here.
Your friend has fabricated a loaded die. In doing so, he has first simulated a number at random from {0.1, 0.2, 0.3, 0.4}. He tells you that the die is loaded in such a way that any roll of the die results in the outcome 6 with a probability that equals the simulated number. Next, the die is rolled 300 times and you are informed that the outcome 6 has appeared 75 times. What is the posterior distribution of the probability that a single roll of the die gives a 6?
Short Compact solution
The prior distribution assigns each of the probabilities 0.1, 0.2, 0.3, and 0.4 a prior probability 0.25. The likelihood of observing 75 sixes out of 300 rolls is proportional to theta^(75)*(1 - theta)^(225). The posterior probability is then found by multiplying the likelihood by the prior for each theta value and normalizing. Numerically,
p(theta=0.1 | data) = 3.5 × 10^(-12)
p(theta=0.2 | data) = 0.4097
p(theta=0.3 | data) = 0.5903
p(theta=0.4 | data) = 3.5 × 10^(-12)
Hence, the die’s probability of rolling a six is most likely 0.3 or 0.2, with their posterior probabilities 0.5903 and 0.4097, respectively.
Comprehensive Explanation
Overview of Bayesian Framework
We are given a random variable theta that denotes the probability of rolling a 6 on this loaded die. Initially, theta can be one of four discrete values: 0.1, 0.2, 0.3, or 0.4. Each value is assumed to have an equal prior probability of 0.25.
When we observe 75 sixes out of 300 rolls, we update our prior beliefs using Bayes’ Theorem. In simpler terms, we calculate the likelihood of seeing these 75 sixes given each possible theta value, multiply by the prior, and normalize to ensure the resulting probabilities sum to 1.
Likelihood Function
Because each theta is a possible fixed probability of rolling a 6, the total number of sixes X out of 300 follows a Binomial distribution with parameters n = 300 and p = theta. The probability of seeing exactly 75 sixes given theta is:
likelihood(data | theta) = C(300, 75) * theta^(75) * (1 - theta)^(225)
where C(300, 75) is the binomial coefficient “300 choose 75.” However, since C(300, 75) is a constant for all theta values, we can ignore that factor when comparing and normalizing among different theta candidates.
Posterior Distribution
Let p_0(theta) = 0.25 for each of the four possible values of theta in {0.1, 0.2, 0.3, 0.4}. Denote the data (i.e., the outcome of 75 sixes in 300 rolls) as D. By Bayes’ Theorem, the posterior distribution is:
Here, theta takes on the values 0.1, 0.2, 0.3, and 0.4.
The numerator is the product of the likelihood and the prior for each specific theta.
The denominator is the sum of these products across all possible theta values, ensuring proper normalization.
Numerical Values
Plugging in each theta = 0.1, 0.2, 0.3, and 0.4, and noting that p_0(theta)=0.25 for each, we compute:
For theta=0.1: 0.1^(75) * 0.9^(225) * 0.25
For theta=0.2: 0.2^(75) * 0.8^(225) * 0.25
For theta=0.3: 0.3^(75) * 0.7^(225) * 0.25
For theta=0.4: 0.4^(75) * 0.6^(225) * 0.25
We then divide each term by the sum of all four terms to get the posterior probabilities. The calculations lead to:
p(theta=0.1 | D) ≈ 3.5 × 10^(-12)
p(theta=0.2 | D) ≈ 0.4097
p(theta=0.3 | D) ≈ 0.5903
p(theta=0.4 | D) ≈ 3.5 × 10^(-12)
These four posterior probabilities sum to 1.
Interpretation
The result indicates that after observing 75 sixes in 300 rolls, the die’s probability of landing on six is very unlikely to be 0.1 or 0.4. The data (75 out of 300) aligns more strongly with probabilities around 0.2 or 0.3, with the highest posterior being at 0.3. The difference between 0.2 and 0.3 is enough that 0.3 becomes the most probable parameter, though 0.2 still has substantial posterior weight.
Potential Follow-Up Questions
How does one interpret the fact that the posterior probabilities for 0.1 and 0.4 are nearly zero?
When we observe 75 sixes in 300 trials, the empirical frequency of sixes is 0.25. This is relatively far from 0.1 and 0.4, so the likelihood at 0.1 or 0.4 is much lower compared to 0.2 or 0.3. Because these are discrete theta values, even a small deviation in empirical frequency from 0.1 or 0.4 can drastically reduce their likelihood, making them almost impossible posterior choices.
Why do we ignore the binomial coefficient in the posterior calculation?
The binomial coefficient C(300, 75) is the same for every candidate theta. In Bayesian updating for discrete parameter choices, any multiplicative constant that does not depend on theta cancels out in the normalization factor. We only need the terms that differ among theta candidates, namely theta^(75)*(1 - theta)^(225).
Could we have used a Beta prior instead of a discrete uniform prior?
Yes. A common approach in Bayesian inference with binomial data is to use a Beta distribution as a prior for a parameter representing a probability. Here, however, the problem explicitly states a discrete prior with values {0.1, 0.2, 0.3, 0.4}. If we had chosen a continuous prior Beta(alpha, beta), the resulting posterior would also be Beta(alpha + 75, beta + 225), but that is a different scenario than the discrete one described.
How might one implement this approach in Python?
Below is a simple example of how one could compute these posterior values for discrete priors:
import numpy as np
thetas = [0.1, 0.2, 0.3, 0.4]
prior = 0.25 # uniform discrete prior
data_sixes = 75
data_trials = 300
# Compute unnormalized posteriors
unnormalized = []
for theta in thetas:
like = (theta**data_sixes) * ((1 - theta)**(data_trials - data_sixes))
post = like * prior
unnormalized.append(post)
# Normalize
total = sum(unnormalized)
posterior = [val / total for val in unnormalized]
for t, p in zip(thetas, posterior):
print(f"Theta={t}, Posterior={p}")
This code snippet calculates the posterior for each theta by multiplying the likelihood by the prior and then normalizing.
Are there edge cases to be concerned about?
Extreme Observations: If we had an extreme observation like 300 sixes in 300 rolls, then the posterior for theta=0.4 would dominate the distribution (though 0.4 is still quite far from 1.0). In fact, with discrete values of theta, if none of them can exactly match the empirical frequency, the most closely aligned will dominate.
Numerical Underflow: When dealing with large exponents such as theta^75 and (1 - theta)^225, extremely small floating-point numbers can occur. In practice, you might apply a log transform to avoid underflow.
These considerations are important in large-scale or repeated computations.
Below are additional follow-up questions
What if the die’s true probability changes over time? How would that affect the Bayesian inference?
If the probability of rolling a 6 is not fixed but can drift or change over time, the entire inference procedure changes significantly. The standard binomial likelihood approach assumes independent and identically distributed (i.i.d.) trials with a fixed probability theta. If that assumption is violated, our posterior might be misleading.
Possible Approach: One could adopt a dynamic Bayesian model, such as a hidden Markov model or a state-space model, where theta is allowed to change from one roll to the next according to some transition distribution.
Consequences of Not Accounting for It: If the die is actually changing its probability but we assume a single fixed theta, the posterior estimates for any single theta will likely produce large predictive errors in future rolls.
Practical Pitfall: In a real-world scenario, parameters can indeed drift due to wear or changes in conditions. Failing to account for this time variation can lead to overly confident but incorrect estimates.
How do we handle the possibility of measurement or labeling errors in the observed outcomes?
Real-world data can contain mislabeled or incorrectly recorded observations. Suppose some of the “sixes” reported are actually another outcome, or vice versa. This kind of error biases the count of successes and can invalidate the direct binomial likelihood approach.
Robust Bayesian Model: One can introduce an error rate parameter that models the probability of incorrectly labeling an outcome as a six (or vice versa). This modifies the likelihood function to incorporate labeling error.
Sensitivity Analysis: Even a small mislabeling probability can drastically shift the posterior distribution. One should do a sensitivity analysis to see how robust the posterior is to different assumptions about error rates.
Practical Example: If the process for logging results has a known 2% error rate, you might incorporate that 2% chance that an outcome is flipped from “six” to “not six.”
Could we combine results from multiple dice if they have the same set of possible theta values?
If we have multiple such dice, each fabricated under the same process (drawing theta from {0.1, 0.2, 0.3, 0.4}), we might want to combine data from all dice to better estimate the distribution of possible theta values. This leads to hierarchical modeling considerations.
Shared Prior: We might specify that each die has a prior over {0.1, 0.2, 0.3, 0.4}, but not necessarily the same posterior after observing their respective outcomes.
Pooling Information: If we suspect the dice are identical, we could pool the outcomes to form a single posterior. However, if there is heterogeneity among dice, a hierarchical model would be more appropriate, treating each die’s theta as an unknown from the same discrete set but allowing separate inferences.
Pitfall: If one erroneously assumes complete homogeneity across all dice when they actually differ, we risk oversimplifying and incorrectly forcing them all to share the same posterior distribution.
What about posterior predictive checks for future rolls?
A key question in Bayesian statistics is: Given the posterior distribution of theta, what is the distribution of future observations?
Posterior Predictive Distribution: We can calculate the probability of observing x sixes in future n rolls by summing over the possible theta values, weighted by their posterior probabilities. Specifically, if we let X_future be the random variable denoting the number of sixes in n future rolls, then:
X_future ~ sum_{all theta} [ Binomial(n, theta) * p(theta | data) ]
This is essentially a mixture of binomial distributions, each weighted by the posterior p(theta | data).
Checks for Model Adequacy: We can simulate new data under the posterior predictive distribution and compare it to the actual new outcomes. If there is a large discrepancy, we may need to refine our model or question our assumptions.
How do we deal with the possibility that the true theta value is not in {0.1, 0.2, 0.3, 0.4}?
Realistically, the die might land on six with a probability different from these four discrete points. If the set {0.1, 0.2, 0.3, 0.4} is just an approximation or guess, it could exclude the correct probability.
Model Mismatch: In Bayesian terms, we are restricting our hypothesis space to only four discrete parameter values. If the true theta is, say, 0.25, none of the model’s possible values precisely match the data-generating process.
Practical Consequence: This mismatch can lead to suboptimal posterior estimates, typically favoring whichever value is closest to the observed frequency of 0.25.
Solution: A more flexible approach would be to use a continuous prior over [0,1], such as a Beta distribution, thus not restricting ourselves to only four discrete points.
Is it ever acceptable to combine probabilities 0.2 and 0.3 into a single “aggregate” if they both seem plausible?
One might be tempted to say, “the posterior mass for 0.2 and 0.3 is high, so let’s treat them as a single composite probability.”
Statistical Implications: Combining them changes the interpretation of theta entirely, effectively collapsing two distinct hypotheses into one. This can be done, but we lose fine-grained detail about which specific value is more likely.
When Might You Combine?: If an application only cares about whether theta < 0.25 or theta >= 0.25, it may make sense to aggregate. But for many cases, the distinction matters (like expected payoff calculations, risk assessments, or additional data collection to discriminate more precisely).
Pitfall: Aggregation can hide important differences between the two probabilities (e.g., for large-scale predictions, a 0.1 difference in probability might have major cost or risk implications).
How would sequential updating be handled if we didn’t receive all 300 rolls at once?
In some practical scenarios, we might receive data in batches. Rather than a single dataset of 300 rolls, we might get partial updates over time.
Bayesian Updating: We can update the posterior after each batch of data. If at step i we have a posterior p_i(theta), and then we observe the next batch’s outcomes, we update p_i(theta) to p_{i+1}(theta) using the new likelihood.
Tracking Posterior Over Time: This would yield a trajectory of posterior distributions, typically converging toward a mode that best fits the observed frequencies as the data accumulates.
Benefit: If something changes or if we notice anomalies early, sequential updating allows us to detect them sooner rather than after all 300 rolls are completed.
What if we prefer a decision-theoretic viewpoint rather than pure posterior inference?
In many real-world contexts, we’re not just inferring parameters but deciding on an action (e.g., whether to use this die in a game or not). In decision theory, we consider expected utility or cost of decisions based on the posterior distribution.
Decision Rule: We might choose an action that maximizes the expected payoff, integrating over the posterior of theta. For example, if a certain threshold of probability for rolling a 6 is beneficial or harmful, we use the posterior distribution to compute expected outcomes.
Bayes Risk: The formal approach calculates the risk function under each possible theta, weights it by p(theta | data), and picks the action minimizing expected loss.
Pitfall: Relying on a single “point estimate” (like the posterior mode at 0.3) might overlook significant probability mass around 0.2. A robust decision might factor in the full distribution, especially if the cost of a misjudgment is high.
What happens if we have strong reasons to believe some probabilities are more likely a priori?
A uniform prior of 0.25 for each of the four values may not always be justified. Perhaps the person who fabricated the die is known to prefer a probability close to 0.3, so we might give that option a higher prior weight.
Impact on Posterior: A nonuniform prior will shift the posterior distribution, increasing the weight of whichever value had a higher prior. Observed data can still override the prior if strongly contradictory, but with moderate data, the prior can heavily influence the outcome.
Elicitation: In practice, you might interview the die’s creator or observe the manufacturing process to refine your belief about which value is most likely before seeing any rolls.
Pitfall: Overconfident or incorrect priors can bias the posterior. Sometimes a robust or weakly informative prior can mitigate this risk.
Could we approximate the discrete model with a continuous Beta prior for ease of computation or modeling?
Sometimes a discrete prior is used for conceptual or theoretical reasons, but in practice, using a Beta prior with certain hyperparameters might be simpler or more standard.
Computational Approach: A Beta-Binomial model has a closed-form posterior Beta(alpha + x, beta + n - x), which is easy to implement.
Difference in Interpretation: A discrete prior is more interpretable if we’re certain theta must be one of a few specific values. A continuous prior is more flexible but loses the strict assumption that theta is restricted to {0.1, 0.2, 0.3, 0.4}.
Edge Case: If the problem truly constrains theta to a known finite set, using a Beta prior is no longer precisely correct. But if that constraint is only approximate, Beta-based modeling might be more natural and powerful.