ML Interview Q Series: Calculating Expected Box Occupancy with Binomial Probability and Linearity of Expectation.

May 09, 2025

Browse all the Probability Interview Questions here.

You distribute randomly 25 apples over 10 boxes. Each apple is placed in a box chosen uniformly at random and independently of all other apples. What is the expected value of the number of boxes that will contain exactly k apples, for k = 0, 1, …, 25?

Short Compact solution

Let X_i be the random variable that is 1 if box i has exactly k apples and 0 otherwise. Then the probability of any given box having exactly k apples is given by the binomial expression

Using linearity of expectation, the expected number of boxes that contain exactly k apples is 10 times that probability:

Comprehensive Explanation

To understand why the probability that a single box has exactly k apples is a binomial expression, consider that each of the 25 apples is placed in one of the 10 boxes uniformly at random. For a particular box, say box i, the chance that a single apple lands in that box is 1/10, and the chance it does not land in that box is 9/10. Since there are 25 independent trials of placing apples, the probability that exactly k of them land in box i is the binomial coefficient choose(25, k) multiplied by (1/10)^k multiplied by (9/10)^25-k.

Once we confirm that probability for a single box, we let X_i be an indicator random variable that takes the value 1 if box i contains exactly k apples, and 0 otherwise. By definition, E[X_i] = P(X_i = 1). Because there are 10 boxes and each one independently has a probability P(X_i = 1) of containing k apples, the expected total number of boxes with exactly k apples is 10 × E[X_i]. This leads directly to 10 × choose(25, k) (1/10)^k (9/10)^25-k.

There is no need to worry about overlap between boxes in computing the expected value, because the linearity of expectation applies regardless of dependencies. Even though the number of apples going into each box are linked through the total number of apples, E[X₁ + … + X₁₀] is simply the sum of the individual expectations E[X_i].

What if the distribution is not uniform?

If instead of placing each apple in any box with probability 1/10, we assign it to box i with some probability p_i (where p₁ + … + p₁₀ = 1), the probability that a specific box i ends up with exactly k apples becomes choose(25, k) p_i^k (1 − p_i)^25-k. The linearity of expectation argument remains the same, so the expected count of boxes that have exactly k apples becomes the sum over all boxes, giving 10 × choose(25, k) p_i^k (1 − p_i)^25-k, if p_i is the same for all i. If p_i differs from box to box, you would simply compute each box’s probability separately and add them up. The key takeaway is that the binomial formula’s success probability is replaced by the appropriate p_i, and the expectation sums up accordingly.

How do we interpret the result when k = 0?

When k = 0, we want to compute the expected number of boxes that contain no apples. For one box, the probability that no apple lands in it is (9/10)²⁵, by the same reasoning as before but with k=0. Thus the expected number of boxes that end up empty is 10 × (9/10)²⁵. This means that on average, we might expect around 10 × (9/10)²⁵ empty boxes when 25 apples are randomly distributed among 10 boxes.

Why does linearity of expectation work even with dependencies?

One might wonder if we can simply add up E[X_i] since the placements in different boxes depend on each other—if an apple goes to one box, it cannot go to another. The powerful result is that linearity of expectation holds regardless of such dependencies, because E[ X + Y ] = E[X] + E[Y] for any random variables X and Y, independent or not. This means that even if the events of one box containing k apples are somewhat linked to whether another box contains k apples, the expected total is unaffected by those correlations.

How can we simulate this to verify?

It is straightforward to write a Python simulation. For each trial, you place 25 apples at random among 10 boxes and then count how many boxes end up with exactly k apples. After many trials, you average the counts to approximate the expectation. A simple Python code:

import random
import statistics

def simulate_expected_boxes(k, num_boxes=10, num_apples=25, trials=10_000_0):
    results = []
    for _ in range(trials):
        boxes = [0]*num_boxes
        for _ in range(num_apples):
            chosen_box = random.randint(0, num_boxes-1)
            boxes[chosen_box] += 1
        count_k = sum(1 for box in boxes if box == k)
        results.append(count_k)
    return statistics.mean(results)

k_value = 5  # Example k
estimate = simulate_expected_boxes(k_value)
print(f"Estimated expected number of boxes with exactly {k_value} apples:", estimate)

Comparing the result to the closed-form formula 10 × choose(25, k_value) (1/10)^k_value (9/10)^25−k_value will show close agreement, especially as the number of trials grows.

What if we want the expected number of boxes containing at least k apples?

This scenario differs because it involves the probability that a box has k or more apples. For one box, you can sum from j=k to 25 over choose(25, j) (1/10)^j (9/10)^25-j. The expected number of boxes is then 10 times that sum, by the same reasoning with linearity of expectation. A single indicator random variable could represent whether box i has at least k apples, and its expected value would be the sum of probabilities from j=k to 25. Then multiply by 10 to get the total expected number of boxes satisfying that condition.

Below are additional follow-up questions

What if each box can hold only a limited number of apples (capacity constraints)?

One interesting twist is when each box can accommodate at most M apples. For instance, if M < 25, then there is a physical constraint preventing any box from having more than M apples.

In the unconstrained (original) case, each apple goes into a box with probability 1/10, without limit. Once we impose a capacity M, the distribution logic changes. If a box is “full” (i.e., it has already accumulated M apples), then subsequent apples can no longer go to that box.

Potential pitfalls:

Non-trivial probabilities: The probability of a box ending up with exactly k apples under capacity constraints requires more complicated combinatorial analysis or simulation. It is no longer a simple binomial distribution for each box, because the outcomes are dependent in a more complex way (once one box is full, future apples must choose among the remaining boxes).
Order matters: If you place apples sequentially and stop filling a box once it hits capacity, the final counts are strongly path-dependent. This contrasts with the original scenario where all distributions were equally likely.
Simulation complexities: Implementing a simulation can still be straightforward, but you must carefully handle the scenario of a box reaching capacity mid-simulation.
Edge cases: If M < k, obviously the probability that a box ends up with exactly k apples is zero.

To handle this analytically, you could use combinatorial or Markov chain approaches, but it becomes more involved than the classic binomial setting.

How do we calculate the variance of the number of boxes containing exactly k apples?

So far, the focus has been on the expected value (mean). Another natural question is the variance of X = X₁ + … + X₁₀, where X_i is 1 if box i has exactly k apples, and 0 otherwise.

Outline of the approach:

Recall the definition: Var(X) = E(X²) − [E(X)]².
Compute E(X²): Expand X² = (X₁ + … + X₁₀)² = ∑X_i² + 2∑_i<jX_iX_j.
Simplify: Since X_i is an indicator variable (X_i² = X_i), we get E(X²) = ∑E(X_i) + 2∑_i<jE(X_iX_j).
Correlations: E(X_iX_j) = P(box i has exactly k apples AND box j has exactly k apples). These events are not independent because the number of apples that go to one box influences how many are left for others. One must evaluate that joint probability carefully via combinatorial methods.

Potential pitfalls:

Ignoring dependencies: A common mistake is to assume independence among X_i’s, leading to an incorrect variance calculation.
Overly complex combinatorics: The exact derivation might be tedious, and practitioners often resort to simulation or approximations.

How does the distribution change if we place apples in boxes sequentially but with changing probabilities?

In the original setup, each apple has probability 1/10 to go into each box, independently of what happened before. A variation is that the probability might adapt based on how many apples are already in each box or some external factor.

Detailed considerations:

Adaptive probabilities: Suppose after i apples have been placed, the probability that the (i+1)-th apple goes to box j depends on the current counts. For instance, you might want to keep the boxes as “balanced” as possible or follow some heuristic.
Loss of a straightforward binomial model: The distribution for each box is no longer binomial, because each event is not a fixed p = 1/10. Instead, p can change dynamically, often making closed-form solutions very difficult or impossible.
Markov chain perspective: One way to study such a process is as a Markov chain, where the state is (n₁, …, n₁₀) indicating how many apples each box currently holds. You can theoretically find exact probabilities of final states, but it can quickly become intractable for large numbers.
Real-world tie-ins: In practical scenarios, you might have a “load balancer” that redirects apples (or tasks) to the least loaded box. Expectation of exact counts can then deviate significantly from the uniform random distribution’s results.

What if we allow the possibility that some apples are defective or “unplaceable”?

Imagine a fraction of apples (say α% of the total) cannot be placed in any box. Equivalently, you might say we discard some random portion of apples before placing them. Let Y be the number of “valid” apples that actually get placed.

Random number of placed apples: Instead of distributing exactly 25 apples, you distribute Y apples, where Y is a random variable. A simple model might be that each apple has a probability q = 1 − α to be valid, and so Y follows a binomial(25, q).
Conditional distribution: Condition on Y = y. Then you place those y apples uniformly at random over 10 boxes. The probability that box i gets exactly k apples given Y = y is choose(y, k) (1/10)^k (9/10)^y−k.
Expected value: The overall expected count of boxes with exactly k apples is then E_Y[10 × choose(Y, k) (1/10)^k (9/10)^Y−k]. One must sum or integrate over the possible values of Y.

Pitfalls:

Double binomial layering: We now have a binomial random variable Y inside another binomial process for the distribution of Y apples into boxes. It can become cumbersome to derive a closed-form solution.
Misinterpretation: Failing to distinguish between the unconditional probability (where Y is random) and the conditional probability (given Y = y) can lead to confusion.

How can we approximate the distribution for large numbers of apples and boxes?

If both the number of apples N and the number of boxes B are large, the exact binomial formula can become computationally expensive. In many large-scale problems (e.g., distributing tasks to servers), a Poisson approximation is often used.

Poisson approximation rationale: When N is large and the probability p = 1/B is small (but Np = λ is moderate), the number of apples in any given box can be approximated by a Poisson(λ) random variable, where λ = N/B.
Implication for “exactly k”: The probability that a box has exactly k apples is approximately ( e^−λ λ^k ) / k!. Then the expected count of boxes with exactly k apples becomes B × ( e^−λ λ^k ) / k!.
Pitfalls:
- The accuracy of the Poisson approximation depends on how large N is, how large B is, and whether N/B remains a reasonable constant.
- Extreme tail events might be under- or over-estimated by the Poisson model.

What if we want the entire distribution of “number of boxes containing exactly k apples,” not just the expectation?

You might need not only the mean but the full probability that exactly x boxes end up with k apples, for x = 0, 1, 2, …, 10. This is a more nuanced question:

Hypergeometric-like dependencies: The event “box i has k apples” and “box j has k apples” are interdependent in complicated ways since the total number of apples is fixed at 25.
Multinomial perspective: You can frame the entire distribution of the 10 boxes as a single multinomial distribution with parameters (n=25, p₁=1/10, …, p₁₀=1/10). Then consider how many boxes in that distribution have a cell count of k. This can be approached but leads to a sum over partitions or a distribution known as the occupancy distribution.
Potential approach: For each partition of 25 apples among 10 boxes, check how many have exactly k. Calculate probabilities from the multinomial formula for each partition. Summing them up across all partitions that produce exactly x boxes with k apples yields the probability in question. However, this is usually not trivial to compute by hand, and might require dynamic programming or direct enumeration for moderate problem sizes.

How do rounding or fractional issues arise in real-world data?

If “25 apples” is a theoretical concept but in reality each “apple” might represent a fraction of a resource (e.g., partial tasks or partial items that can be split across boxes), the assumption of integer distribution no longer strictly applies.

Fractional resources: Instead of counting how many apples went to each box, you might track how many total “fractions of an apple” end up in each box. This generally leads you to a different distribution model (possibly a Dirichlet-multinomial or other continuous analog).
Rounding: In practice, you might need to round partial apples to the nearest integer. This rounding changes the probability that a box ends up with exactly k apples.
Pitfalls:
- Overlooking the fact that partial apples are not truly “discrete.”
- Mismatch between theoretical discrete distributions and the real system that might allow partial allocations.

How would collisions or collisions-based constraints change the outcome?

Sometimes in hashing or load balancing problems, collisions are disallowed or strongly discouraged. Translating that to “apples in boxes,” you might say: “No two apples can go into the same box simultaneously,” or only one apple is allowed per box until all boxes have at least one apple, and so on.

Sequential constraints: For instance, if the apples arrive one at a time, each time we place an apple into any box that is not yet “occupied,” we get a classical occupancy problem with a changing number of available boxes. This yields a different probability structure altogether.
Uniform distribution no longer guaranteed: If you’re not allowing collisions, then after the first apple goes to a box, that box may become “unavailable” for the second apple, if the rule is “exactly one apple per box.” By the time the 10th apple is placed, all boxes might be full (if we have only 10 boxes). The distribution for k≥2 is obviously 0 in that scenario.
Pitfalls:
- Misinterpreting partial constraints (like “at most two apples per box”) can inadvertently lead to complicated counting arguments.
- Solutions that assume fully independent placements do not apply when collisions or capacity rules are enforced.

Rohan's Bytes

Discussion about this post