ML Interview Q Series: Markov Chain Model for Expected Steps to a Uniform Roll-and-Relabel Die.

May 26, 2025

Browse all the Probability Interview Questions here.

Question: Suppose you roll a fair 6-sided die six times and note each result. You then label a new die’s faces with exactly those six outcomes (including any repeats). Afterward, you roll this new die six more times, record those results, use them to label yet another die, and repeat this cycle. Eventually, you’ll end up with a die whose faces are all the same number. How would you find the expected number of these “roll-and-relabel” steps required to reach that uniform-face die?

Comprehensive Explanation

The situation describes a sequential process of rolling a fair die 6 times, using the observed outcomes to label another die, and repeating. Over successive iterations, the new die’s faces become progressively more homogeneous until eventually all faces match. The question focuses on computing the average number of these roll-and-relabel iterations needed to end up with a die containing identical faces.

Understanding the Process

Initial step: Begin with a standard fair 6-sided die and roll it 6 times. Each roll is equally likely to be 1, 2, 3, 4, 5, or 6. Suppose the recorded outcomes are (3, 5, 3, 6, 1, 2), just as an example. These six values label the faces of a new unlabeled die. That means this new die could have faces like 3, 3, 5, 6, 1, 2.
Subsequent steps: Now roll the newly created die 6 times. Because its faces may be repeated in certain ways, the probability of each outcome is not necessarily 1/6. For instance, if the new die had three faces labeled “3” and one face each labeled “5,” “6,” “1,” then the probability of rolling a “3” would be 3/6 = 1/2, while the other numbers would each have a probability of 1/6.
Labeling yet another die: Take the 6 rolls from the new die and use those values to label a fresh unlabeled die. The process continues, generating a sequence of dice, each with faces that reflect the outcomes of rolling the previous die. Eventually, as these outcomes may cluster more and more around certain values, you will get a die whose 6 faces are all the same. That is your absorbing or terminal event.
Goal: Find the average (expected) number of iterations it takes from the start (a fair die) to the end (a completely uniform die).

Markov Chain Perspective

A rigorous way to handle “expected time to reach a uniform-face die” is via Markov chains. Each state in the chain can be thought of as the distribution of faces on the current die. Specifically, a die can be described by the counts of each possible face value. For example, a state might be “face 1 occurs 3 times, face 3 occurs 2 times, face 5 occurs 1 time.”

Because the next die is labeled from 6 independent rolls of the current die, the transition probabilities depend on sampling from the current die’s probabilities.
One state, for instance, might correspond to (1,1,1,1,1,1), meaning the die is fair with each face having 1 occurrence. Another state might be (2,2,1,1,0,0), meaning two faces appear twice and two faces appear once, while two faces do not appear at all, and so on.
The “absorbing” states of interest are (6,0,0,0,0,0), (0,6,0,0,0,0), etc., representing a die with all faces the same.

The classical formula for the expected time to absorption in a finite Markov chain involves building the transition matrix, identifying absorbing states, and then using the concept of the fundamental matrix. Let’s label the set of transient states as everything that is not uniform, and the set of absorbing states as the ones with a single face repeated 6 times.

If we arrange the transition probabilities so that:

Q is the submatrix of transition probabilities among the transient states.
I is the identity matrix of appropriate dimension.

Then the fundamental matrix N is given by

where each entry N(i,j) represents the expected number of times the process is in transient state j, given that it started in transient state i. The expected number of steps before absorption, starting from transient state i, is the row sum of N for that state.

Practical Approach: Simulation and Approximation

While the Markov chain technique is mathematically exact, enumerating every possible distribution state can be cumbersome because there are many ways to partition 6 outcomes among the numbers 1 through 6. A simpler approach in many practical or interview settings is a Monte Carlo simulation:

Simulation Algorithm
1. Represent a die configuration as a list of the 6 faces (e.g., [3,3,5,6,1,2]).
2. Roll this configuration 6 times by sampling from the list with equal probability for each face. Collect these 6 new outcomes into another list.
3. Replace the old die configuration with the newly obtained list of outcomes.
4. Check if the new list has 6 identical faces. If yes, record the iteration count; if not, repeat from step 2.
5. Repeat the entire experiment many times (e.g., 100,000 simulations) to approximate the average number of iterations to reach a uniform-face die.
Why this works: By the law of large numbers, running enough simulations will converge to the true expected number of steps. While it might not give an exact symbolic result, it gives a strong numerical estimate that can be made as accurate as necessary by increasing the number of simulation runs.

Below is a potential Python snippet showing how a simulation might be structured:

import random
import statistics

def simulate_one_sequence():
    """Simulate one entire run from a fair 6-sided die to a uniform-face die."""
    # Start with a fair die: faces = [1,2,3,4,5,6]
    current_die = [1,2,3,4,5,6]
    steps = 0
    while True:
        steps += 1
        # Roll current die 6 times
        new_faces = [random.choice(current_die) for _ in range(6)]

        # Check if we have uniform faces
        if len(set(new_faces)) == 1:
            return steps
        # Otherwise, set current die to newly generated one
        current_die = new_faces

def estimate_expected_steps(num_trials=100000):
    """Estimate the expected number of steps over many trials."""
    results = [simulate_one_sequence() for _ in range(num_trials)]
    return statistics.mean(results)

if __name__ == "__main__":
    print("Estimated Expected Steps:", estimate_expected_steps(100000))

This simulation would return an empirical approximation of how many iterations (roll-and-relabel cycles) one typically needs.

Explanation of Key Mathematical Elements

State representation In the Markov chain method, each transient state can be denoted by a 6-tuple (c1, c2, c3, c4, c5, c6) such that c1 + c2 + c3 + c4 + c5 + c6 = 6, where each ci is the count of how many times face i appears. Such a state also implies probabilities when rolling that die next time: face i appears with probability ci/6.
Transitions From a given configuration with probabilities p1, p2, …, p6 (each pi = ci/6), rolling that die 6 times yields counts that follow a multinomial distribution. Specifically, the probability of obtaining a new configuration (k1, k2, k3, k4, k5, k6) is:
p = (6! / (k1! k2! k3! k4! k5! k6!)) * p1^k1 * p2^k2 * p3^k3 * p4^k4 * p5^k5 * p6^k6
where k1 + k2 + k3 + k4 + k5 + k6 = 6.
Absorbing states Once you get a configuration such that all faces are the same number (i.e., (6,0,0,0,0,0) or permutations), you remain in that state forever, because rolling a die whose faces are all identical always yields the same result. This completes the absorption event.

Possible Follow-up Questions

How does one handle large state spaces efficiently in practice?

Large state spaces can be made tractable either via:

Symmetry reduction: Group states that share the same counts of identical faces. For instance, all permutations of (3,2,1,0,0,0) are equivalent.
Simulation: As shown, approximate the expected time via Monte Carlo.
Sparse matrix techniques: If transitions are sparse, store and manipulate Q in a sparse format.

Could the chain get “stuck” in a non-uniform state?

No. There is always a nonzero probability of eventually transitioning to a configuration closer to uniform. Indeed, from any non-uniform distribution, it is possible to roll it 6 times and get the same face if, by chance, the same side appears all 6 times. That probability might be small, but it is not zero, ensuring that the chain almost surely reaches the absorbing state eventually.

Would the expected number of steps change if we used a different number of rolls per iteration?

Yes, it heavily depends on the number of samples (here, 6) you use to generate the new die. If you used a different number (like rolling 12 times to form a new 12-faced die, in a hypothetical scenario), the probabilities and the overall chain characteristics would differ. More samples might reduce randomness in each step, potentially changing the average time to absorption.

How does the multinomial distribution help in computing exact probabilities?

When you roll a die (with probability distribution p1, p2, …, p6) 6 times, the probability that exactly k1 times face 1 appears, k2 times face 2 appears, etc., is captured by the multinomial expression:

p = (6! / (k1! k2! … k6!)) * p1^k1 * p2^k2 * … * p6^k6

This formula determines the transition probability from one distribution state to another. Summing such probabilities over all possible new distributions yields the corresponding row in the Q matrix.

How do you interpret the fundamental matrix (I - Q)^{-1}?

Q is the submatrix of transition probabilities among transient states.
I is the identity matrix of the same dimension.
(I - Q)^{-1} expands effectively into a series that captures the expected number of visits to each transient state before absorption.
By summing the appropriate row of this inverse matrix, you obtain the expected number of steps before finally entering an absorbing state (the uniform-face configuration) for the first time.

The final takeaway is that while a direct combinatorial/Markov chain solution is precise, a straightforward simulation can also provide an accurate estimate of the average number of steps until the die is uniform on all faces.

Below are additional follow-up questions

How would the analysis change if the first die was not fair initially?

If the original die you start with is biased (i.e., the probability of rolling a 1, 2, 3, etc. isn’t each 1/6), the probability distribution of the outcomes in the very first set of 6 rolls becomes different. This changes the distribution of possible states that the second die can occupy. Once you move beyond the first die, though, the subsequent dice are always formed by rolling whatever the current distribution of faces is, so the process evolves similarly in principle—just starting from a different “initial state.”

In Markov chain terms, rather than beginning in the fair-die state (1,1,1,1,1,1), you begin in some other distribution that reflects the observed counts from rolling a biased die 6 times. The rest of the chain logic is unaffected: the transition probabilities among states remain governed by multinomial draws from each new die’s faces. However, because the initial state might already favor certain outcomes, it may either speed up or slow down the path to uniform faces.

A tricky pitfall is incorrectly assuming you can just adjust the final probability or treat it like a fair scenario. In reality, you need to either track or simulate that biased start. In a Monte Carlo simulation, you could handle this by simulating the initial 6 rolls with the real bias, then proceeding normally. In a Markov chain formulation, you modify the initial distribution vector to reflect the biased draw.

What if we only relabel some subset of the faces of the new die at each step?

A variation could be that instead of fully replacing the die’s faces with the 6 new outcomes, you replace only a portion (say half) of the faces each time. The dynamic then becomes more complex because each iteration is partially preserving older faces. This can slow convergence significantly because you might carry over some faces that don’t appear in the 6 new rolls, preventing the die from homogenizing quickly.

In Markov chain terms, this partial relabeling means the transition from one state to the next is no longer purely determined by 6 samples from the old distribution; some fraction of faces remain “locked in” from the previous iteration. The transition probabilities have to reflect the combination of old faces that remain plus the new faces that get introduced.

One subtle issue arises if you misunderstand the distribution each time: the newly labeled portion of the die may be strongly biased in favor of the repeated face, but the leftover faces might remain more varied. This typically prolongs the time to uniformity because you need enough cycles for all older faces to eventually get overwritten by the same repeated face.

How is the average time to uniformity affected if we label the new die with more than 6 rolls?

If at each iteration you roll the current die N times (with N > 6) and then label a new N-faced die, the state space and probabilities change because you no longer have a 6-outcome dataset. For the original question’s 6-faced scenario, rolling 6 times is symmetrical and matches the number of faces. If you increase that sample size, each iteration might produce a more “statistically stable” estimate of the underlying distribution, possibly making it converge to a single repeated face slower or faster depending on how the distribution evolves.

For instance, with more rolls (like 12 or 18) to decide the next labeling, the chance of getting exactly the same face all 6 times (for a 6-faced die) is lower, relatively speaking, than if you only took 6 rolls. On the other hand, if you were constructing a new die with 12 faces from 12 rolls, the dimension is different, so the random process has a new set of absorbing states (all 12 faces the same) and a different transition structure.

One pitfall is believing that more samples always speed up reaching a uniform distribution. It can actually reduce variance in each iteration, which might make transitions to heavily repeated faces less likely at each step. A thorough Markov chain or simulation analysis is needed to see which effect dominates.

Could random chance result in a quick convergence to all faces matching very early?

Yes. There is always a nonzero probability that, purely by luck, you roll the same face 6 times early on, instantly creating a uniform die. When your new die has 6 identical faces, you are effectively absorbed in that state.

A common misconception is to assume that convergence will always take many steps because it seems improbable for all faces to match. But across many trials, some runs will converge in just one or two steps. This lowers the average relative to a scenario where every sequence took a “typical” number of steps. Hence, the distribution of the time to convergence can be quite wide, featuring a heavy tail for sequences that take a long time and a small cluster of short-run convergences due to happenstance.

Does the order in which you write down the 6 outcomes when labeling matter?

No. The labeling step doesn’t care about the order in which the 6 results are observed. What matters is the multiset of outcomes—how many times each face appears among the 6 rolls. Whether you label the new die as (3,3,6,1,5,2) or (3,6,3,2,5,1) is irrelevant; it’s the same distribution of faces.

The subtlety is that if someone tries to track states in a naive way, they might incorrectly count distinct permutations as different states. In Markov chain modeling, you want to group all permutations of the same face counts into a single state to simplify the analysis. This kind of “lumping” is key in reducing the dimensionality of the state space.

What if the absorption condition changes to “having at least four faces the same”?

Changing the absorption condition to something like “4 out of the 6 faces are the same” introduces new absorbing states. You can get stuck in a distribution (for example, 4 copies of face 3, plus 2 copies of face 5) and remain there if you keep re-rolling and re-labeling. However, you’d have to confirm if it is truly absorbing in a dynamic sense—because even with 4 out of 6 faces labeled the same, future rolls can replace that distribution.

A key point is that for a distribution to be absorbing, it must be impossible to leave. If you only require 4 faces the same, you can still roll outcomes that generate a new distribution with fewer than 4 copies of that face. So it’s not actually an absorbing state in the strict Markov chain sense, unless you artificially decide to stop once that configuration is reached.

If you do stop intentionally after you first observe “at least 4 faces the same,” the expected time to that event will be shorter than the time to all faces matching. But from a Markov chain standpoint, that’s not “absorption” in the original random process; it’s an early-stopping rule you impose.

Can extreme skew in intermediate distributions paradoxically slow down reaching uniform faces?

Yes. Imagine a die that becomes extremely skewed but not fully uniform—say 5 faces labeled as “2” and 1 face labeled as “4.” In that configuration, the probability of rolling “2” is 5/6, and the probability of rolling “4” is 1/6. Because “2” is so dominant, you might think it quickly leads to a uniform “2”-face die. But if you never roll “4” in the next 6 tries, then your new die will be fully “2.” That can happen quite often, which does speed up uniform convergence.

However, consider some intermediate states that might oscillate, such as (4,1,1,0,0,0) spread among different faces. Sometimes, partial dominance can linger because you keep rolling just enough occurrences of the minority face(s) to preserve them in the new distribution. In a large sample scenario (if you rolled many more times per step), that minority face might appear often enough to stay alive, prolonging the approach to uniformity.

This highlights a potential pitfall: not recognizing that a near-uniform state might still endure multiple transitions without becoming fully uniform if there is a small but consistent probability of preserving the outlier face.

How might we validate a closed-form solution if we claim to have one?

If someone tries to derive a closed-form expression for the expected time, a prudent approach is to:

Compare with simulation results: Run large-scale simulations and see if the analytical formula’s numeric value lines up with the empirical average.
Check smaller cases: If the problem is in a simpler dimension (e.g., a 3-faced die, rolling 3 times each iteration), then you can sometimes enumerate all states and transitions to see if the closed-form lines up exactly.
Evaluate boundary scenarios: For instance, how quickly do you converge if the number of faces is 1 or 2? Does the formula reflect near-trivial conditions correctly?

Pitfalls occur if certain states or transitions are overlooked. Small mistakes in enumerating the states or in the multinomial probabilities can lead to incorrect closed-form solutions. Thorough cross-checking with simulation is essential for confidence.

Why is the time to reach uniform faces random and not deterministic?

Each iteration’s labeling depends on probabilistic rolls from the current distribution. There’s inherent randomness in how many times each face appears in the next set of 6 rolls. Consequently, some sequences might converge after just 2 iterations (if you get lucky and roll the same face 6 times early on), while others might wander through semi-random distributions for many iterations. That makes the time to absorption a random variable, hence talking about an “expected” (mean) time.

A subtle but important point is that even in states with high repetition of a single face, there’s still a chance to roll enough of a minority face to keep the distribution from collapsing to full uniform. This leads to variability in the number of steps.

Could a partially malfunctioning die or measurement errors disrupt the Markov chain assumption?

Yes. The entire Markov chain perspective relies on each new die being labeled exactly based on the outcomes of rolling the old die. If in a real-world setting you have measurement or labeling errors (e.g., you wrote the wrong outcome, or you mislabeled a face), the chain model breaks down. You then have transitions that don’t strictly follow the multinomial distribution from the current configuration.

In practice, if errors are small and sporadic, it might just be a slight perturbation to your Markov chain. But if systematic errors occur, you effectively have a different or even non-Markov process. Real-world experiments thus require careful attention to how you record and label faces, ensuring the assumptions remain valid.

Rohan's Bytes

Discussion about this post