ML Interview Q Series: Expected Value and Distribution of Nested Uniform Random Variables
Browse all the Probability Interview Questions here.
Imagine there is a function, X, that produces an integer randomly in the range [N, M]. Then, using X’s output as the upper bound for a second function Y (with the same lower limit N), the second function returns a random integer in [N, X]. How would Y’s distribution appear, and what is its expected value?
Comprehensive Explanation
When we draw a number X uniformly between N and M, each integer in that interval is equally likely. Now, using the random value X as the upper limit for another uniform draw Y in [N, X], the resulting distribution of Y will no longer be uniform over [N, M]. Instead, lower values of Y will appear more frequently than higher values. Here is the reasoning:
X is chosen from the discrete set of integers N, N+1, ..., M with probability 1/(M-N+1).
Conditional on X taking a specific value x, the variable Y is then uniformly drawn from the integers N, N+1, ..., x with probability 1/(x-N+1).
Therefore, to get the probability that Y equals a particular integer y in [N, M], we sum over all x values from y up to M, because for Y to be y, we need X to be at least y.
Below is the core probability mass function for Y:
Parameters in the above expression:
N is the minimum integer in the initial range.
M is the maximum integer in the initial range.
X is an integer chosen uniformly from N to M.
Y is then uniformly chosen from N to X.
Expected Value of Y
We can leverage the law of total expectation, which states E(Y) = E( E(Y | X) ). Since, given X=x, Y is uniformly distributed between N and x, we know that E(Y | X=x) = (N + x)/2 (using the standard mean of a discrete uniform distribution). Hence:
E(Y) = E( (N + X)/2 ) = (N/2) + (1/2)*E(X)
Because X is uniform in [N, M], E(X) = (N + M)/2. Substituting:
E(Y) = (N/2) + (1/2)*( (N + M)/2 ) = (N/2) + (N + M)/4 = (2N + (N + M))/4 = (3N + M)/4
In more explicit form:
N is the lower bound, M is the upper bound, and (3N + M)/4 is the mean of Y under this two-step sampling process.
Interpretation
The distribution tilts toward smaller values because Y cannot exceed X, and X might often be closer to N than to M, especially for smaller increments between N and M.
The expected value (3N + M)/4 is below the midpoint (N + M)/2 of the entire range. This indicates Y is biased toward the lower side of [N, M].
Code Example in Python
Below is a simple simulation to illustrate how Y is distributed and to verify the empirical mean aligns with the derived analytical expectation.
import numpy as np
def simulate_Y(N, M, num_samples=10_000_00):
# Generate X uniformly from [N, M]
X_samples = np.random.randint(N, M+1, size=num_samples)
# For each X_sample, generate Y uniformly in [N, X_sample]
Y_samples = np.array([np.random.randint(N, x+1) for x in X_samples])
return Y_samples
# Set parameters
N = 2
M = 10
num_samples = 1_000_000
# Simulate
Y_samples = simulate_Y(N, M, num_samples)
empirical_mean = np.mean(Y_samples)
# Print results
theoretical_mean = (3*N + M)/4
print("Empirical Mean of Y:", empirical_mean)
print("Theoretical Mean of Y:", theoretical_mean)
If you run this code multiple times, you should see that the empirical mean of Y gets closer and closer to (3N + M)/4 as num_samples grows large.
What the Distribution Looks Like
The probability that Y = N is relatively high, since any X between N and M includes N as a possible outcome for Y.
The probability that Y = M is much lower, because X must be equal to M for Y to reach M.
As y increases from N to M, the number of x values satisfying x >= y shrinks, so the probability mass decreases. This creates a monotonically decreasing shape for P(Y = y) across y from N up to M.
Why Is the Distribution Not Uniform?
Because the second draw depends on the first draw’s outcome, Y cannot take any value in [N, M] with equal probability. The chance of Y taking a higher value (close to M) is limited by the requirement that the first draw X is also large enough. That conditional structure imposes a higher likelihood of smaller values.
Potential Follow-up Questions
What if N = M?
In that scenario, X must always be M. Then Y must also be M (since Y is uniform on [M, M], which means only one possible integer). The distribution of Y collapses to a single point, and the expected value of Y would be M. This is consistent with the formula (3N + M)/4 = (3M + M)/4 = M when N = M.
How Does the Result Change If X Is Drawn from a Continuous Distribution?
If X were chosen continuously in [N, M], then Y would be chosen uniformly in [N, X] on a continuous interval. The resulting distribution for Y would still skew toward lower values. You could compute the density function by an integral instead of a sum, leading to E(Y) = (3N + M)/4 under similar uniform assumptions in the continuous domain.
Can This Procedure Be Extended Repeatedly?
One might consider taking the random output Y as the new max for a next stage Z, and so forth. Each added stage narrows the distribution further toward smaller numbers. The more times you repeat the process, the more the distribution skews to the lower boundary, and the expected value drifts closer to that boundary.
How Can We Demonstrate the Distribution Formally?
A direct discrete approach is to use the summation for P(Y = y). Another approach is to observe that Y is a composition of uniform random variables: Y = N + (X - N)*U, where U is a random fraction in [0, 1] (for a continuous analog). In the discrete case, we similarly track that Y - N is uniform in [0, X - N], but X itself is uniform in [N, M], leading to the triangular-type distribution shape.
All these details confirm that the resulting samples are not uniform and that the expected value emerges cleanly as (3N + M)/4.
Below are additional follow-up questions
How does the variance of Y compare to that of X?
When sampling X uniformly from [N, M], we know that X has variance ( (M - N + 1)² - 1 ) / 12 if we treat X as discrete uniform. For large M-N, one can approximate it by (M - N)² / 12. By contrast, Y is drawn uniformly from [N, X], so its variance is different and generally smaller than that of X because Y is biased toward N.
To see why, recall that:
E(Y | X=x) = (N + x)/2
E(Y² | X=x) is the mean of squares for a uniform integer from N to x. For x >> N, it is approximately ( x² + xN + N² ) / 3.
Hence:
where E(Y²) = E( E(Y² | X) ) and E(Y) = (3N + M)/4.
A subtlety is that the discrete uniform variance formula (for any random integer U from a to b) is ((b-a+1)² - 1)/12. Because we condition on X in each realization, the unconditional variance of Y is a weighted combination of these conditional variances. In practical computing terms, if M - N is large, sampling enough pairs (X, Y) to estimate Var(Y) empirically can be done in code; you would see the result consistently smaller than Var(X).
Potential pitfalls:
For small ranges (e.g., M-N=1), direct formulas might produce division by zero if not handled correctly.
The discrete uniform variance formula is sometimes mixed up with the continuous version.
In real-world scenarios, approximate or truncated sampling might further alter the variance.
If X Is Not Drawn from a Uniform Distribution, How Does That Affect Y?
Many real random processes do not produce uniform outcomes. Suppose X is drawn from a distribution with heavier tails toward M (e.g., a skewed or exponential-like distribution). Then Y is more likely to come from a larger range, thus pushing its distribution toward higher values. Conversely, if X is more often near the lower bound N, Y shifts even more to the lower side.
The key steps:
Identify the probability p(x) = P(X=x).
Condition on X=x to get Y ~ Uniform(N, x).
Combine these via summation/integration over x to find the overall distribution of Y.
Pitfalls:
Overlooking that Y’s distribution heavily depends on how X’s probabilities accumulate near the larger end or the smaller end.
If X has a tiny probability of being near M but a large probability of being near N, then almost all Y values will cluster near N.
Some random processes might produce correlated sequences of X values (e.g., a Markov chain), which further complicates Y’s distribution across multiple samples.
Are X and Y Independent? How Might One Measure Their Relationship?
X and Y are clearly not independent. Y is constrained to be at most X, so whenever X is small, Y is also forced to be small. One way to quantify their dependence is to compute the correlation coefficient Corr(X, Y). In code, you could generate many (X, Y) pairs, then use a standard correlation measure.
Because Y depends monotonically on X, the correlation is positive but not 1.
Another measure of dependence is mutual information, which captures more nuanced relationships.
Pitfalls:
Some might incorrectly assume Y and X are independent just because each is “uniformly” chosen in some sense.
If you measure correlation on too few samples, you might get misleading estimates.
For large integer ranges, you need many samples to converge to a stable correlation estimate.
Would the Distribution Change If We First Sample X and Then Round It Before Drawing Y?
Sometimes real-world processes produce continuous values for X but then round or clamp them to an integer. If you pick X from a continuous interval [N, M], then round X to the nearest integer, and finally pick Y in [N, rounded(X)], you introduce discontinuities that may shift the distribution in subtle ways:
If rounding is upward for all halves (or downward), you may bias X’s integer distribution.
Y’s distribution might overrepresent certain ranges if X frequently lands near boundary points and then gets pushed outward by rounding.
Pitfalls:
Rounding large volumes of continuous samples can produce unexpected lumps at certain integers.
The actual distribution of Y may deviate significantly from the theoretical discrete uniform if the underlying continuous draw for X has internal skew or peaks.
What If N or M Is Negative or Zero?
Allowing N and/or M to be negative or zero does not break the logic but can change how we interpret the random process:
X might be drawn from negative integers up to some positive integer. You still have X≥N, so Y≥N as well, but now N might be negative.
The expected value for Y = (3N + M)/4 can itself be negative or positive depending on N and M.
If N < 0 < M, there is a possibility that X could be either negative or positive, affecting Y’s range accordingly.
Pitfalls:
Some random integer functions or libraries may behave unexpectedly with negative bounds (though most standard RNGs handle them properly).
Edge cases where N = -1 and M = 0 can lead to confusion, particularly in indexing logic or array-based sampling.
How Can We Directly Sample Y in One Step Without First Drawing X?
One might want a direct procedure to generate Y with the same distribution but skip the intermediate step X altogether. While you can mathematically derive P(Y=y) and sample from it, you must incorporate the probability weights from summation:
P(Y=y) = sum of [ (1/(M-N+1)) * (1/(x-N+1)) ] over x from y to M.
In principle, you could:
Build a lookup table for y = N...M based on the above probabilities.
Use a standard “inverse transform sampling” approach to draw Y in a single step.
Pitfalls:
Building this table might be computationally expensive for large M-N.
If M-N is very large (millions), storing the table and computing probabilities might be impractical, so using the two-step approach X then Y could be more efficient in code.
Could This Approach Be Useful in Sampling-Based Algorithms Like MCMC?
Yes, some Markov Chain Monte Carlo methods rely on hierarchical sampling steps. If you have a hierarchical model where X is a latent variable and Y is conditionally uniform given X, this structure is quite common. For example, in a Bayesian model with discrete latent variables, you might repeatedly sample X from its posterior, then sample Y from P(Y|X).
Pitfalls:
If you try to treat Y as if it were drawn from a simpler distribution, you lose the correlation structure with X. That can produce wrong parameter updates.
In MCMC, ignoring the dependence between X and Y might lead to slower convergence or incorrect inferences if you do not preserve the conditional distribution accurately.
How Could We Validate Our Random Seed Usage Does Not Generate Undesirable Patterns in Y?
If you use a fixed random seed for reproducibility, you could inadvertently create a pattern in the pairs (X, Y). Repeatedly calling the same RNG for X and then Y might produce correlated sequences that appear uniform but have subtle repetition cycles.
Check the distribution of Y over multiple runs with different seeds to see if the distribution shifts or remains stable.
Use tests like the Kolmogorov-Smirnov or chi-square test to ensure Y’s empirical distribution matches the predicted distribution.
For cryptographic applications, you generally want unpredictable seeds and might require hardware-based RNG to avoid any pattern.
Pitfalls:
Relying on a single fixed seed can hide systematic RNG bias or periodicity.
Some older PRNGs have short periods, so repeated usage can reveal cyclical behavior in Y.
What If We Insert an Acceptance-Rejection Criterion for Y?
An acceptance-rejection strategy might come into play if, after generating Y, we impose some condition “Reject Y if it doesn’t satisfy a property.” For instance, “Reject if Y is below some threshold.” In that case:
The unconditional distribution of Y changes, because we discard some of the draws.
The acceptance-rejection steps create a truncated or biased sample that no longer corresponds to the simple formula for P(Y=y).
Pitfalls:
If you naively keep generating X and Y until you get an acceptable Y, you might waste substantial computation if the acceptance region is small.
The correlation between X and Y might be further distorted, making it even more difficult to analyze or replicate the distribution.