ML Interview Q Series: Analyzing Product of Dice Rolls: PMF, Expected Value, and Standard Deviation.
Browse all the Probability Interview Questions here.
You roll a fair die twice. Let the random variable X be the product of the outcomes of the two rolls. What is the probability mass function of X? What are the expected value and the standard deviation of X?
Short Compact solution
The random variable X can take any product i*j where i and j range from 1 to 6. Its possible values are 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24, 25, 30, and 36, each occurring with probabilities counting how many (i, j) pairs yield that product out of the 36 total equally likely outcomes. The probability mass function is therefore:
X=1 with probability 1/36
X=2 with probability 2/36
X=3 with probability 2/36
X=4 with probability 3/36
X=5 with probability 2/36
X=6 with probability 4/36
X=8 with probability 2/36
X=9 with probability 1/36
X=10 with probability 2/36
X=12 with probability 4/36
X=15 with probability 2/36
X=16 with probability 1/36
X=18 with probability 2/36
X=20 with probability 2/36
X=24 with probability 2/36
X=25 with probability 1/36
X=30 with probability 2/36
X=36 with probability 1/36
The expected value of X is 12.25, and the standard deviation of X is approximately 8.94.
Comprehensive Explanation
Defining the Random Variable
When rolling a fair six-sided die twice, there are 36 equally likely outcomes (i, j) with i ranging from 1 to 6 and j ranging from 1 to 6. The random variable X is defined as the product of the two outcomes, so X = i*j.
Probability Mass Function (PMF)
Each outcome (i, j) has probability 1/36. To find P(X = x), we count how many pairs (i, j) produce the product x and then divide by 36.
For example, to get X = 6, we look for all pairs (i, j) such that i*j = 6. These pairs are (1,6), (2,3), (3,2), (6,1). There are 4 such pairs, so P(X=6) = 4/36.
In general, for each distinct product value x, we count the number of (i, j) pairs with i*j = x. Because i and j each range from 1 through 6, all product values are in {1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24, 25, 30, 36}. The result is the PMF listed in the short solution.
Expected Value of X
The expected value E(X) is computed as the sum of x * P(X = x) over all possible x. Equivalently, we can use the double sum over (i, j):
Inside this double sum, i ranges from 1 to 6 and j ranges from 1 to 6. Summing i*j over i and j yields:
sum(i=1 to 6, sum(j=1 to 6, ij)) = (sum(i=1 to 6, i)) * (sum(j=1 to 6, j)) because sum(i=1..6, ij) factorizes as (sum(i=1..6, i)) * (sum(j=1..6, j)) for constants-of-separate-summation. The sum(1..6, i) = 1+2+3+4+5+6 = 21. Thus 21 * 21 = 441.
Dividing 441 by 36 gives 441/36 = 12.25.
Hence E(X) = 12.25.
Variance and Standard Deviation of X
To compute the variance, you need E(X^2). Once we have E(X^2), we use Var(X) = E(X^2) - [E(X)]^2 and then take the square root to get the standard deviation.
We can similarly compute E(X^2):
Through direct calculation or by separating sums of squares, one arrives at E(X^2) = 8281/36. Then
Var(X) = 8281/36 - (49/4)^{2},
because E(X) = 49/4 = 12.25. Evaluating this difference yields approximately 80.00 (more precisely about 79.984). The standard deviation is the square root of that, which is about 8.94.
Practical Python Snippet
import math
# The probability distribution can be computed directly:
probabilities = {}
for i in range(1, 7):
for j in range(1, 7):
product = i*j
probabilities[product] = probabilities.get(product, 0) + 1
for k in probabilities:
probabilities[k] /= 36.0
# Compute expected value
E = sum(k * probabilities[k] for k in probabilities)
# Compute E(X^2)
E2 = sum((k**2) * probabilities[k] for k in probabilities)
variance = E2 - E**2
std_dev = math.sqrt(variance)
print("PMF:", probabilities)
print("E(X):", E)
print("Std Dev of X:", std_dev)
This code calculates the same PMF, expected value, and standard deviation. It confirms E(X) ≈ 12.25 and standard deviation ≈ 8.94.
Potential Follow-up Questions
How would the solution change if the dice were not fair?
If the dice were biased, each (i, j) pair would no longer have probability 1/36. Instead, the probability of rolling i on the first die might be p(i), and the probability of rolling j on the second die might be q(j). Then the probability of the pair (i, j) would be p(i)q(j) (assuming independence). The formula for E(X) would become the sum over all i, j of (ij)*p(i)*q(j), and similarly for E(X^2). Because each outcome is not equally likely, the PMF would be different.
What if the two rolls were dependent?
If the two dice outcomes somehow influenced each other (which is rare in a physical scenario, but might happen in a theoretical or contrived example), the joint probability P(i, j) = P(i)*P(j|i). We would need the conditional probabilities P(j|i) to compute P(X = x). The summations for E(X) and E(X^2) would involve those joint probabilities. The concept is the same, but the distribution would need to be carefully recalculated from the conditional or joint distribution.
How do you generalize this to multiple dice?
If you extend X to be the product of N fair dice, then X = X1 * X2 * ... * XN, where each Xi is an independent roll of a fair die. There are 6^N equally likely outcomes for (X1, X2, ..., XN). One can still compute the PMF by counting how many outcomes produce each possible product. The expected value is the product of individual expected values (if you treat E(Xi)=3.5, then E(X)=3.5^N), but the PMF becomes more involved because the set of distinct products grows rapidly as N increases.
Are there any real-world implications for random products?
In some probability modeling tasks, the product of random variables appears in scenarios like random growth processes or random interest rates compounding. The distribution can become heavily skewed, and large values may appear with small probability. Understanding the PMF, mean, variance, and potential tail behavior can be crucial. In the case of dice, it’s mostly a discrete, bounded example, so it’s simpler than many real-world analogs, but it’s still a helpful illustration of how random products can behave.
Could we use moment-generating functions?
Yes. One may derive E(X) or E(X^n) using the moment-generating function (MGF) or probability generating function (PGF) for discrete variables. For example, the PGF of a single fair die roll is the average of z^i for i=1..6, i.e., (z+z^2+z^3+z^4+z^5+z^6)/6. For the product, you would combine the PGFs in a particular way. But for a small case like 2 dice, direct enumeration is simpler.
These considerations help demonstrate deeper understanding and ensure readiness for challenging interviews that probe both conceptual understanding and the ability to generalize.
Below are additional follow-up questions
Could we analyze the distribution of X purely by combinatorial arguments instead of enumerating all pairs?
Yes. One can avoid explicit enumeration by counting the number of ways a product can be formed from the factors of 1 to 6. Specifically, for each possible product x in {1, 2, 3, …, 36}, you look at all factor pairs (i, j) where i and j each belong to {1, 2, 3, 4, 5, 6} and see how many distinct ways i and j can be chosen (taking order into account if necessary). The idea is to apply combinatorial factorization logic:
For a candidate product x, list all divisors d of x such that d is in {1..6}.
For each valid divisor d, check whether x/d is also in {1..6}.
The count of valid divisors that meet these conditions—taking into account (d, x/d) and (x/d, d) as two distinct pairs if d != x/d—gives the total number of (i, j) outcomes that yield X = x. Dividing by 36 gives P(X = x). This combinatorial approach is essentially a structured counting method rather than brute force enumeration.
A potential pitfall here is to forget that (d, x/d) and (x/d, d) are distinct outcomes if d ≠ x/d. Another oversight might be to include divisors outside {1..6}. By carefully applying the constraints (1 ≤ d ≤ 6, 1 ≤ x/d ≤ 6), the correct probability mass function emerges without explicitly listing all 36 pairs.
What is the skewness and kurtosis of the random variable X, and why might they matter?
Skewness is a measure of the asymmetry of a distribution. Kurtosis measures the "tailedness," i.e., how heavy the tails are relative to a normal distribution. For the product of two fair dice, X is discrete and ranges from 1 to 36, with many intermediate values like 4, 6, 12 occurring more frequently than very high or very low values (1 or 36). Qualitatively, one expects a positive skew because smaller products (e.g., 1, 2, 3) occur with fewer factor pairs compared to moderate products, and very large values like 36 can only be obtained from (6,6). Thus, the distribution's mass is somewhat concentrated in the midrange, with a tail on the higher end.
In practical data science, knowing skewness and kurtosis helps in deciding whether standard modeling assumptions like normality are appropriate. If X had a large positive skew, methods that assume a symmetric distribution around the mean might give misleading results. Also, heavy tails (high kurtosis) increase the likelihood of extreme outcomes, which can be crucial in risk-sensitive applications.
Could the PMF be symmetric around its mean if we changed something about the problem?
Typically, the product distribution of two independent discrete uniform variables will not be symmetric because the product grows in a non-linear way. Some modifications that might bring a measure of symmetry could include transformations (e.g., taking a log). Even then, discrete distributions of such products are rarely symmetric about any point. Another theoretical angle would be to consider a scenario where the dice faces are custom-labeled in a specific, contrived manner to balance out products in a symmetrical fashion, but this would deviate from the usual 1 to 6 labeling.
Pitfalls often arise if an analyst assumes or tries to treat the distribution as if it were symmetric. This can lead to underestimating tail probabilities or incorrectly using symmetric confidence intervals.
What if we apply the transformation Y = ln(X)? How might that affect our analysis?
When you transform X to Y = ln(X), you convert multiplicative relationships into additive ones. Specifically, ln(i*j) = ln(i) + ln(j). Because each i and j is from {1..6}, ln(X) will take values {ln(1), ln(2), ln(3), ln(4), ln(5), ln(6), ln(8), …, ln(36)}, with corresponding probabilities. This can sometimes make theoretical derivations or approximations easier (because sums of random variables often have more tractable distributions, or can be approximated by the Central Limit Theorem under certain conditions).
However, since i and j are each discrete, Y still remains a discrete set of possible values (although not integer values). You would typically lose the simplicity of integer arithmetic. Another pitfall: the distribution of ln(X) might look “more normal” in some sense, but it’s still quite discrete and finite in domain. So while it can help in certain theoretical contexts (e.g., analyzing multiplicative processes), it might not produce the same continuous, well-behaved distribution that arises when dealing with large-sample log transformations in other contexts.
Could we approximate X by a simpler well-known distribution for large face counts?
If you imagine the die generalized to an n-sided fair die, labeled from 1 up to n, and let X be the product of two independent rolls from that n-sided die. As n grows large, i and j become large uniform integers. In principle, one might approximate ln(i) as a random variable that is roughly uniform on {ln(1), ln(2), …, ln(n)}. For large n, that set becomes dense. Summing two such approximately uniform(ln(1), ln(n)) random variables might get you a distribution reminiscent of a triangular shape in the log domain, though it’s still discrete.
When you exponentiate back to get the distribution of X, it becomes skewed, and might not conform well to standard “named” distributions (like normal or Poisson). A lognormal shape might be a rough guess, but it’s typically not exact. The major pitfall is incorrectly assuming normality or another common distribution because that can produce large errors in tail probability estimates. Approximations must be tested carefully with either simulations or bounding arguments.
How does the distribution of X differ from the sum of the dice outcomes, and why could that be important?
When rolling two dice, the sum S = i + j ranges from 2 to 12 and has a classic triangular-shaped distribution, with 7 being the most likely sum. By contrast, the product X = i*j ranges from 1 to 36, with many possible outcomes in the mid-teen range, but distributed differently. The sum distribution is symmetric about 7, while the product distribution is not symmetric about its mean (12.25). Furthermore, the product distribution can generate extreme values (like 36) with only one pair, while sums generate extremes (2 or 12) with one pair each.
In many game designs or probabilistic analyses, the difference between using sums and products can drastically affect the probability of large or small outcomes. A key pitfall is conflating the behavior of sums with the behavior of products. For example, “7” is a central and quite probable sum, but “12.25” as the mean product is not matched with a single outcome. It’s just an average. This difference in distribution shape influences how risk or payout might be calculated if the underlying random variable was a product rather than a sum.
How do we handle computational challenges if we wanted to extend this to, say, the product of multiple dice?
Extending to three or more dice means enumerating 6^n outcomes for n dice, which quickly becomes computationally large (6^3=216, 6^4=1296, etc.). A direct count of factor combinations can be done, but the number of distinct product values grows, and enumerating them explicitly may become expensive for large n. One approach is to keep track of the distribution in a dictionary or array that accumulates frequencies for each product. Each time you add an extra die, you convolve the existing distribution with the factor possibilities from the new die.
The pitfall is that the range of possible products grows exponentially in n, so memory or computation can become infeasible for large n. Strategies to mitigate this include using more sophisticated data structures (like sparse dictionaries) or random sampling to approximate the distribution.
If we considered the median or mode of X, how would we compute them, and why might they be different from the mean?
The mode of a discrete random variable is the most frequently occurring value. For two dice, you can look at which product has the highest count of factor pairs. By enumerating the factor counts, you find that 6 and 12 both appear 4 times each (i.e., they each have 4 favorable (i, j) pairs). Thus 6 and 12 share the status of mode with probability 4/36 = 1/9 each.
The median is the value m such that at least 50% of the probability lies below or at m, and at least 50% of the probability lies above or at m. You’d sum the probabilities in ascending order until crossing 0.5. For two dice, the median typically falls somewhere between 8 and 12, because a large portion of the probability mass is in that region.
The pitfall: sometimes people equate mean, median, and mode in distributions that are not symmetric. They can be very different, especially in skewed distributions like the product of dice. Knowing each measure can clarify “typical” outcomes versus average outcomes and can help in designing or interpreting fair payouts, risk measures, or threshold-based decisions.
Could partial information about (i, j) change the distribution of X? For example, if you knew one of the dice results?
Yes, if you knew that the first die (i) is, say, 4, then X = 4*j for j in {1..6}, each of which is now equally likely with probability 1/6. The distribution of X then collapses to the set {4, 8, 12, 16, 20, 24} with probability 1/6 each. Conditioned on i=4, E(X) is 4 * E(j) = 4 * 3.5 = 14. Meanwhile, Var(X) would also be computed under that conditioning. This is a typical conditional probability scenario.
A subtle pitfall is failing to recalculate the distribution of X given partial information and continuing to use the unconditional distribution (like 1/36) for each pair (i, j). Conditioned on knowledge of i or j, the PMF changes significantly, affecting all subsequent calculations like expected payout or confidence intervals.
How might we simulate experiments if we had custom dice with unusual faces (e.g., prime-number faces only)?
Custom dice might have faces like {2, 3, 5, 7, 11, 13}. You can still define X as the product of the outcomes, but the set of possible products changes. If you want an exact PMF, you must enumerate all pairs from the set of custom faces. Suppose each face is still equally likely, then the probability of each (i, j) is 1/36 for 6-faced dice. The distribution is found by counting how many ways to form each product from those prime faces.
The pitfall is to assume you can still rely on the same 1..6 distribution or that the mean remains 12.25. Once the faces deviate from standard {1..6}, both E(X) and the distribution shape will change drastically. Many real-world dice or random number generators aren’t standard and may require careful re-derivation of the PMF.