ML Interview Q Series: Expected Flips for Consecutive Heads: A State-Based Calculation
Browse all the Probability Interview Questions here.
If a fair coin is flipped repeatedly, how many flips are expected on average until two heads appear consecutively?
Short Compact solution
Define X to be the total number of flips until two heads in a row occur. We analyze it by conditioning on initial outcomes. If the first flip is tails, we essentially reset, so the expected flips remain E[X]. If the first flip is heads, we look at the second flip: with probability 1/2 it is heads, in which case we have finished (no more flips needed); otherwise, it is tails, and we must start over. Setting up these equations and solving yields a final answer:
Hence, on average, six flips are needed to see two consecutive heads.
Comprehensive Explanation
Intuitive Overview
We are looking for the expected (average) count of fair coin tosses until we achieve a specific pattern: “Heads followed by another Heads.” One way to approach such a problem is with a state-based perspective, where each state represents how far we have progressed toward our goal of consecutive heads.
Detailed State Analysis
We can define the following states:
A “start” or “no progress” state, meaning we have not just flipped a head.
A “one head” state, meaning the most recent flip was a head, and we are one step away from achieving two consecutive heads.
A “success” state, meaning we have just flipped two heads in a row.
Let E[X] represent the expected flips to reach the success state from scratch. Let E[X∣H] represent the expected flips to finish once we have already seen one head. We treat the success state as having zero additional flips required because we are done at that point.
From the start state, if we flip tails (probability 1/2), we remain in the “start” state, adding 1 flip plus E[X] to keep trying. If we flip heads (probability 1/2), we move to the “one head” state, adding 1 flip plus E[X∣H] from there.
From the “one head” state, the next flip is critical:
With probability 1/2, it is heads, and we reach success, requiring 0 more flips.
With probability 1/2, it is tails, which knocks us back to the “start” state.
Therefore, mathematically:
Markov Chain Perspective
Another equally valid viewpoint is to construct a small Markov chain with the following states: S0 (no heads yet), S1 (one head), and S2 (two heads in a row). The transition probabilities help form a set of linear equations from which we can solve for the expected time to go from S0 to S2. This approach leads to the same numerical result, 6 flips.
Simulation Approach
In practice, or to check correctness, you could simulate flipping a coin many times in code and measure how many flips it takes on average to see HH:
import random
import statistics
def simulate_two_consecutive_heads(num_trials=10_000_00):
results = []
for _ in range(num_trials):
count = 0
consecutive = 0 # how many heads in a row so far
while True:
count += 1
flip = random.choice([0, 1]) # 0 for Tails, 1 for Heads
if flip == 1:
consecutive += 1
if consecutive == 2:
results.append(count)
break
else:
consecutive = 0
return statistics.mean(results)
print(simulate_two_consecutive_heads())
This simulation should converge to a value close to 6 as the number of trials grows large.
What if the Coin is Biased?
If the coin is not fair, we can still use a similar state-based approach or Markov chain logic, except the transition probabilities for heads and tails become p (for heads) and 1−p (for tails). The resulting equations change accordingly, but the methodology remains the same. The final answer will differ and typically becomes larger than 6 if p<1/2 and smaller if p>1/2.
Could We Generalize to n Consecutive Heads?
Yes. The concept extends to finding the expected flips for n consecutive heads by defining states that track how many consecutive heads have been accumulated so far. When a tail appears, we reset to an appropriate state (often the start if it is simpler to see it as losing progress completely, but advanced patterns can require partial resets in certain substring matching problems). The resulting linear system is solvable, though the algebra becomes more extensive for larger n.
What is the Variance?
The variance can also be computed by enumerating additional equations that account for the squares of the states’ expectations. Alternatively, one can systematically form equations for Var(X) based on the same conditioning logic, or use a Markov chain approach that handles second-moment calculations. While not strictly required to answer the original question, interviewers might ask about it to test deeper knowledge of how to handle second moments.
Edge Cases in Real-World Scenarios
In some practical contexts, a coin might not only be biased but also might shift probabilities over time (e.g., a coin that accumulates wear or is subject to environmental effects). In that case, the assumption of identical independent flips fails, and we would need more sophisticated models to track changing probabilities. The same Markov chain can be extended to nonstationary transitions, though the solution typically no longer has a closed-form expression, and simulation or advanced approximations are used.
Practical Implementation Details
When coding a solution in production, especially if the coin flips are part of a large simulation with many random processes, it is crucial to manage random seeds correctly for reproducibility. Also, performance considerations arise if n is large (e.g., 10, 15, or more consecutive heads), since naive simulations might require many iterations to reach the pattern. Methods like importance sampling or dynamic programming can help optimize the computation under certain circumstances.
Below are additional follow-up questions
Could there be a purely combinatorial approach instead of a Markov chain or state-based method?
A combinatorial approach attempts to calculate the expected value by summing over all possible sequences that yield two consecutive heads for the first time. Specifically, one might consider enumerating sequences of flips that do not contain two consecutive heads until the final pair that concludes the sequence. However, because such sequences can have arbitrary length before the stopping condition, you end up dealing with an infinite sum. This can be turned into a more direct counting process—sometimes invoking the negative binomial perspective or a generating function.
A major pitfall arises when trying to directly sum probabilities over all sequences. It is easy to double-count sequences, or incorrectly include sequences that contain two consecutive heads earlier than the final occurrence. These mistakes can result in flawed or divergent series. The Markov chain or state-based approach is typically more systematic and less error-prone.
If you do pursue a combinatorial route, keep in mind:
You must ensure each valid sequence you count contains no two consecutive heads until exactly the end.
The infinite sum can be transformed into a finite expression using algebraic manipulations of geometric series.
It is crucial to track boundary conditions properly.
How could generating functions be employed to find the expected time?
Generating functions offer a powerful alternative to straightforward state-based equations. One might define a probability generating function (PGF) or probability generating series where the coefficient of a certain term corresponds to the probability that the first occurrence of two consecutive heads happens on a specific flip count. Summing all coefficients times their index (i.e., the flip count) yields the expectation.
However, constructing the generating function can be tricky:
Each new flip can either continue partial progress toward “HH,” reset progress, or complete the pattern.
One must carefully encode these transitions into a formal power series.
Combining or simplifying these series sometimes becomes quite complex, and it is easier to lose track of boundary conditions.
This approach is conceptually elegant but demands strong comfort with series expansions and manipulations. Interviewers might probe whether you can set it up correctly and interpret it without confusion.
How do we handle correlated coin flips rather than independent flips?
If the coin tosses are not independent—say each flip depends on the previous outcomes—then the simple state-based model with probabilities fixed at 1/2 for heads or tails no longer applies. You would need a more general Markov chain that encodes the probability of heads or tails based on the full history (or some partial memory).
Pitfalls when dealing with correlated flips:
It may not be enough to track just “how many heads in a row” because the transition probability from tails to heads might depend on prior flips.
You might require a larger set of states to capture correlation. For instance, if there is a dependency that extends two flips back, states must include information about the last two flips.
In some pathological correlation structures, the probability of getting two consecutive heads might be drastically changed (either much more likely or less likely), which can significantly alter the expected time.
What if we need to estimate not just the expectation but the entire distribution of the stopping time?
Sometimes an interviewer might ask about the probability that two consecutive heads appear by the $n$th flip (the cumulative distribution function), not just the mean. That requires either enumerating or systematically calculating, for each k≤n, the probability that the first occurrence of HH happens on exactly the $k$th flip. You can do this with:
Recursive formulas tracking “no heads in a row,” “exactly one head in a row,” etc., and summing up probabilities for each flip count.
Markov chain methods that compute transition matrices and measure hitting times.
A subtle pitfall is that you can overcount sequences if you do not carefully restrict yourself to sequences where two consecutive heads first appear on the exact flip k. Another subtlety is how to handle partial progress at the boundary between flips (for example, if you just saw a head on flip k−1).
Could the expected time to reach two heads in a row be infinite under certain pathological conditions?
For a fair coin with independent flips, the expected time is always finite (6). However, if we had a sequence of flips with probability of heads approaching 0 over time, or some adversarially generated sequence with diminishing heads-likelihood, it might drive the expectation arbitrarily high.
Real-world pitfalls:
Nonstationary processes: if p (the chance of heads) decreases with each flip, eventually it might become extremely unlikely to see two heads in a row, causing a very large expected time.
Adversarial generation: if flips are not random but determined by some process that actively avoids HH, theoretically you might never see it, though in a purely adversarial setting, “expected value” can become ambiguous.
What if we want to ensure that after we get one head, we allow only a certain number of attempts before resetting?
In some contrived variations, you might reset the entire process if you fail to get the second head within a fixed number of flips after seeing the first head. This changes the Markov structure because once you have a head, you now have a limited “window” to get another head before the system resets to zero progress.
Possible pitfalls:
Failing to account for the possibility that the second head might appear in the allowed window or not.
Mixing up the condition that a tail triggers an immediate reset versus waiting until the window expires.
Overcomplicating the state machine, since you must add a dimension representing “how many flips remain in the window.”
How should we interpret or handle partial knowledge of the coin toss results if some flips are hidden or missed?
In many real experiments, you might not observe every flip clearly (perhaps some flips fell out of sight). If your observation process is incomplete, your state-of-knowledge about whether you are on zero heads or one head is uncertain. You might have a probabilistic belief instead of a definite state.
Pitfalls:
Maintaining a belief distribution across states (e.g., there is a certain probability we are in the “one head” state, a certain probability we are in the “zero heads” state).
Needing a Bayesian update step each time you do see a flip outcome.
The resulting expected time can be higher because missing an observation can cause you to “wait” longer for a guaranteed HH.
How do we test or validate the correctness of the 6-flip result in a real or experimental setup with limited samples?
If you physically flip a real coin, you might do a finite number of experiments and compute the average number of flips it took to see two heads in a row. With limited samples, your empirical mean might deviate from the theoretical 6. Confirming the result demands:
Sufficiently large sample size.
Checking randomization quality (for instance, ensuring the coin is not biased, ensuring flips are well mixed).
Computing confidence intervals to see how close your experimental average is to 6 and whether the discrepancy can be attributed to randomness.
A hidden pitfall is to misinterpret early or small-sample results, concluding the theoretical calculation is wrong. In reality, 6 is the limit of the average over many repeated trials, and small-sample experiments can deviate substantially.
What if we want to extend to a scenario where consecutive heads must appear a certain number of times before stopping?
Sometimes the objective is not just to see the first instance of “HH,” but to see that pattern occur multiple separate times. For example, you keep flipping until you have seen “HH” five different times. This is fundamentally different from requiring “HHHHHH...” (six heads in a row). Instead, you want the pattern “HH” to appear on five disjoint or possibly overlapping occasions.
Key complexities:
Overlaps: you might have a run of heads like HHH, which counts as two occurrences of “HH.”
States would need to track both the immediate streak of heads and how many times you have already completed “HH.”
An alternative approach might break it down into repeated trials of “HH,” but that can double-count scenarios with overlap.
Managing these subtleties carefully is crucial to getting the correct expected count. If an interviewer brings this up, they are likely probing whether you understand partial pattern overlaps and how to systematically account for them.