ML Interview Q Series: Expected Unique Items in Sampling: Solved with Linearity of Expectation.
Browse all the Probability Interview Questions here.
Suppose you have n distinct integers labeled 1 through n. If you randomly pick one integer at a time, uniformly and with replacement, for a total of n picks, what is the average (expected) number of unique integers observed?
By linearity of expectation, the expected total number of distinct integers is
Comprehensive Explanation
Underlying Probability Framework
When you draw an integer from 1 to n uniformly with replacement, each draw is an independent event. For each specific integer ( i ), we examine the probability that it does not appear in any of the ( n ) draws. Because each draw is uniform, the chance of not picking that integer in a single draw is ( \frac{n-1}{n} ). Across ( n ) independent draws, the probability of consistently missing integer ( i ) is: [ \huge \Bigl(\frac{n-1}{n}\Bigr)^n. ] The complement of this event—namely that integer ( i ) does appear at least once—is therefore [ \huge 1 - \Bigl(\frac{n-1}{n}\Bigr)^n. ]
Indicator Variables and Linearity of Expectation
We define ( X_i ) as an indicator random variable that is 1 if integer ( i ) is drawn at least once and 0 otherwise. The expected value of this indicator is just the probability it is 1: [ \huge E[X_i] = p(X_i = 1) = 1 - \Bigl(\frac{n-1}{n}\Bigr)^n. ]
To find the total number of distinct integers, we sum all these indicators: [ \huge X = \sum_{i=1}^n X_i. ] By the property known as linearity of expectation, the expectation of the sum is simply the sum of the expectations: [ \huge E[X] = \sum_{i=1}^n E[X_i] = \sum_{i=1}^n \Bigl[1 - \Bigl(\frac{n-1}{n}\Bigr)^n\Bigr]. ] Since each term is the same, this results in: [ \huge E[X] = n \Bigl[1 - \Bigl(\frac{n-1}{n}\Bigr)^n\Bigr]. ]
Practical Interpretations and Limits
When ( n ) is small, each draw might skip many integers, so the expression ( 1 - \bigl(\frac{n-1}{n}\bigr)^n ) can be noticeably less than 1. This indicates that out of ( n ) total draws, you won’t see all ( n ) numbers.
As ( n ) grows large, (\bigl(\frac{n-1}{n}\bigr)^n \approx e^{-1}). Thus the expected number of distinct integers in ( n ) draws approaches ( n \bigl(1 - e^{-1}\bigr) ), which is roughly ( 0.632n ).
What if we sample more than n times?
When the number of draws is ( m ) (not necessarily equal to ( n )), the probability an integer ( i ) does not get chosen in any draw is [ \huge \Bigl(\frac{n-1}{n}\Bigr)^m. ] Then [ \huge E[X_i] = 1 - \Bigl(\frac{n-1}{n}\Bigr)^m, ] and the expected number of distinct integers becomes [ \huge n \Bigl[1 - \Bigl(\frac{n-1}{n}\Bigr)^m\Bigr]. ] So if you exceed ( n ) draws, simply replace ( n ) with ( m ) in the exponent to get the updated expectation.
How does this relate to the coupon collector problem?
A related concept is the coupon collector problem, which asks how many draws you need to see all ( n ) unique items at least once. That scenario typically calculates the expected time to collect every type of item, whereas here, we are calculating how many unique items you see in exactly ( n ) draws. The mathematics are similar but emphasize different quantities.
Could we find the exact distribution of the number of distinct values?
Yes. The probability that exactly ( k ) distinct integers appear out of ( n ) draws can be derived using combinatorial arguments (e.g., using Stirling numbers of the second kind or occupancy distributions). However, it is often simpler to use indicator variables to compute the expected number of distinct integers, without having to fully characterize the entire distribution.
How does this result behave for very large n?
For large ( n ), the term (\bigl(\frac{n-1}{n}\bigr)^n) converges to ( e^{-1} ). Thus the expected fraction of distinct items seen is about ( 1 - e^{-1} \approx 0.6321. ) Therefore, no matter how large ( n ) gets, if you only draw ( n ) times, you can anticipate seeing around 63% of the distinct numbers on average.
Implementation Example in Python
Below is a short Python snippet that performs a simulation to illustrate the expected number of distinct values when ( n ) draws are taken from ( n ) unique items:
import numpy as np
def expected_unique_estimate(n, trials=100000):
counts = []
for _ in range(trials):
draws = np.random.randint(0, n, size=n)
counts.append(len(set(draws)))
return np.mean(counts)
# Example usage
n = 100
empirical_estimate = expected_unique_estimate(n)
theoretical_estimate = n * (1 - ((n - 1) / n)**n)
print(f"Empirical estimate: {empirical_estimate:.4f}")
print(f"Theoretical estimate: {theoretical_estimate:.4f}")
In this code:
We generate a list of draws (each an integer from 0 to ( n-1 )) of length ( n ).
We take the size of the set of drawn numbers to see how many distinct values occurred.
We repeat multiple trials to get an empirical average.
We compare it with the derived theoretical value ( n \Bigl[1 - \Bigl(\frac{n-1}{n}\Bigr)^n\Bigr]. )
This simulation confirms the theory in practice, with the empirical estimate converging to the closed-form expression as the number of trials increases.
Below are additional follow-up questions
What changes if each item does not have the same probability of being drawn?
When the probabilities are not uniform, each distinct integer ( i ) might have its own probability ( p_i ) of being selected on any given draw. Suppose you perform ( n ) draws independently, each time sampling integer ( i ) with probability ( p_i ). Then:
Indicator Definition Define ( X_i ) = 1 if integer ( i ) is drawn at least once, otherwise 0.
Probability ( X_i = 1 ) The probability that integer ( i ) is never drawn is ( (1 - p_i)^n ). So [ \huge p(X_i = 1) = 1 - (1 - p_i)^n. ]
Expected Value By linearity of expectation, [ \huge E\Bigl[\sum_{i=1}^n X_i\Bigr] = \sum_{i=1}^n E[X_i] = \sum_{i=1}^n \Bigl[1 - (1 - p_i)^n\Bigr]. ] Thus, the expected number of distinct items is no longer a single expression with (\bigl(\frac{n-1}{n}\bigr)^n), but a sum over all different probabilities.
Pitfalls and Edge Cases
If some ( p_i ) are extremely small, there is a high chance those items won’t appear even once, reducing the overall distinct count.
If some ( p_i ) are very large, those items appear in almost every trial, which could skew the distribution of distinct items seen.
Ensuring the probabilities sum to 1 might introduce computational or normalization concerns, especially when ( n ) is huge.
This non-uniform scenario is common in real-world applications where certain outcomes are more likely than others (e.g., user preferences in a recommendation system). The formula for the expected count of unique draws is conceptually the same but depends on each item’s distinct probability.
How does the result change if the draws are dependent rather than independent?
In many real-world processes, consecutive draws might not be independent. For example, once you draw a certain number, the probability of drawing it again might increase or decrease based on some external condition.
Impact on Probabilities If the draws are dependent, you can no longer say the probability of missing a particular item in all draws is simply (\bigl(\tfrac{n-1}{n}\bigr)^n). Each draw’s probability can be conditioned on previous outcomes.
Indicator Variables We can still define indicator variables ( X_i ) for each item ( i ). However, [ \huge p(X_i = 1) = 1 - p(X_i = 0), ] and ( p(X_i = 0) ) now must consider the joint distribution over all ( n ) draws.
Linearity of Expectation Even if the draws are dependent, linearity of expectation still holds: [ \huge E\Bigl[\sum_{i=1}^n X_i\Bigr] = \sum_{i=1}^n E[X_i]. ] The challenge becomes computing ( E[X_i] ) accurately when the draws are correlated.
Pitfalls
Overlooking correlation can lead to over- or under-estimation of ( E[X_i] ).
A naive assumption of independence might be strongly violated in practical systems (e.g., if the sampling mechanism changes based on prior samples).
If the correlation structure is complex, deriving a closed-form solution might be infeasible, requiring simulation or approximate methods (e.g., Markov chains or generative modeling).
Real-world scenarios (e.g., user interactions in a dynamic environment) often exhibit dependence, necessitating a deeper analysis or simulation-based approach to accurately measure the expected number of distinct items.
What if the number of items ( n ) is extremely large or effectively infinite?
When ( n ) is very large, storing or tracking items explicitly can be computationally impractical. We might also face scenarios where ( n ) is unknown or unbounded (e.g., a streaming process with a huge universe of possible IDs).
Approximation Approaches
Using Bloom Filters / HyperLogLog: These data structures let you approximate the count of unique items without storing every element. They handle large-scale sets efficiently.
Asymptotic Analysis: For large ( n ), [ \huge \Bigl(\frac{n-1}{n}\Bigr)^n \approx e^{-1}, ] so the expected count becomes ( n (1 - e^{-1}) \approx 0.632,n ).
Infinite (or Unknown) ( n )
If ( n ) is theoretically unbounded, you may model the process as a Poisson sampling problem. Alternatively, you might track how many unique items are found in a sample without knowing the total possible.
Some algorithms (like adaptive sampling or reservoir sampling) can provide unbiased estimates of how many unique items you might have in a massive set.
Edge Cases
Implementation details (e.g., floating-point errors) become amplified when ( n ) is huge.
Memory constraints can force approximate algorithms. Exact counting would be infeasible.
Real-world data might contain new items over time, essentially making ( n ) grow dynamically.
Hence, although the closed-form formula still exists, practical implementation for large-scale or unbounded ( n ) typically demands approximation and specialized data structures.
What if we only care about the first ( k ) draws and want to know the expected distinct count among those ( k ) draws?
Sometimes, you might do ( n ) total draws but only focus on the first ( k ) for some reason—maybe the first ( k ) draws are critical to your application, or you’re analyzing partial progress towards collecting distinct items.
Direct Subset of Draws If all draws are i.i.d. and uniform over ( n ) items, but you only consider the first ( k ) draws, then the formula changes to: [ \huge E[X_{\text{distinct in first } k}] = n \Bigl[1 - \Bigl(\frac{n-1}{n}\Bigr)^k \Bigr]. ] This is essentially the same logic as before, but with ( k ) in place of ( n ).
Partial Collection Insight
After ( k ) draws, you might want to keep sampling to see how many more distinct items you can collect.
This can help measure how quickly new items appear in a partial sample and can be extended to an incremental or dynamic setting.
Potential Pitfalls
You might confuse the total number of draws ( n ) with the number of draws you’re actively examining ( k ). If ( k < n ), you are intentionally ignoring some draws or perhaps they haven’t happened yet.
If the distribution or probability changes after some draws (e.g., a time-based or adaptive process), analyzing only the first ( k ) draws might not reflect the entire sampling process.
Real-world data scenarios often require analyzing partial segments of draws, especially when you only have partial data or are doing a real-time streaming analysis.
How might we compute or approximate higher moments (like variance) of the number of distinct values?
Beyond the expected number of distinct values, some applications want to quantify uncertainty or variability—how likely it is to see a big deviation from the mean.
Exact Variance Computation Recall the total distinct count is ( X = \sum_{i=1}^n X_i ) with ( X_i ) as indicators. For the variance, [ \huge \mathrm{Var}(X) = \sum_{i=1}^n \mathrm{Var}(X_i) + 2\sum_{i<j} \mathrm{Cov}(X_i, X_j). ] Each ( X_i ) has [ \huge \mathrm{Var}(X_i) = p(X_i=1),\bigl(1 - p(X_i=1)\bigr), ] and the covariance term (\mathrm{Cov}(X_i, X_j)) requires ( p(X_i=1, X_j=1) ).
Probability of ( X_i=1 ) and ( X_j=1 ) For uniform independent sampling, [ \huge p(X_i=1, X_j=1) = 1 - p(\text{miss } i) - p(\text{miss } j) + p(\text{miss both } i \text{ and } j), ] and [ \huge p(\text{miss both } i \text{ and } j) = \Bigl(\frac{n-2}{n}\Bigr)^n. ] Plugging these into the covariance formula can give an exact expression, but it quickly gets unwieldy.
Approximation Methods
For large ( n ), you could use asymptotic expansions or normal approximations.
Monte Carlo simulation can also be used: run repeated sampling and measure variance empirically.
Pitfalls
Ignoring covariance might lead you to underestimate the total variance, especially because indicators are not independent (if you draw one integer often, it can slightly reduce the chance of drawing others).
For non-uniform probabilities or dependent draws, the calculations get even more complex.
Despite the complexity, having a handle on variance can be crucial if you need confidence intervals or want to understand worst-case deviations in how many distinct items appear.
What if the set of possible items can change during the process?
In certain real-world scenarios (e.g., streaming data or dynamic user bases), the number of possible items ( n ) might itself grow or shrink over time.
Evolving Universe of Items
If ( n ) increases after some draws, the previously used formula might no longer hold because the sample space expanded.
You may need a time-dependent model: ( n = n(t) ) at each draw ( t ).
Adaptive Indicator Variables
You cannot define ( X_i ) for a fixed ( i ) from 1 to ( n ) if ( n ) is always changing. Instead, you might define an indicator for “did we draw an item that appeared in the set at time ( t )?”
The probability of drawing a newly added item might differ from that of older items.
Pitfalls
Simply applying the original formula can significantly mislead you if the number of possible items changes mid-process.
If new items are introduced with non-trivial probability, the process of counting distinct items is no longer Markovian in the simpler sense—modeling becomes more complex.
Such dynamics appear, for instance, in continuously updating catalogs of products, or social networks where new entities are constantly introduced. Handling it rigorously can require a mixture of combinatorial logic and partial differential equations or Markov chain modeling, depending on the arrival rates of new items.
Could collisions of values in hashing impact the count of distinct elements?
In practical implementations—especially large-scale ones—integer IDs might be hashed into smaller hash tables or data structures (e.g., Bloom filters). Collisions can cause distinct items to map to the same location in memory.
Bloom Filter False Positives
Bloom filters allow some probability of a false positive, meaning a newly observed item might appear as if it were already seen.
This undercounts the true number of distinct items because collisions artificially merge multiple distinct IDs into one “seen” slot.
HyperLogLog
HyperLogLog is a widely used algorithm for cardinality estimation with a known error bound. It still has collisions, but the error is statistically controlled.
In practice, you get an approximate count of distinct items with a small relative error (e.g., around 2% depending on memory usage).
Pitfalls
Relying on hashed structures can systematically under- or over-count the number of distinct items if collisions or false positives are frequent.
Checking collision probabilities is important to ensure that the approximate data structure meets the accuracy requirements.
For extremely large ( n ), even low collision probabilities can accumulate into noticeable estimation errors.
Hence, the mathematical formula for expected distinct values assumes perfect identification of distinct items. In real systems with hashing or approximate set membership data structures, collisions need to be carefully accounted for (or minimized) to keep track of the true distinct count accurately.
How might parallel or distributed computation affect these calculations?
In large-scale data processing frameworks (e.g., Spark, Hadoop, MapReduce), you might count distinct items across multiple distributed partitions.
Merging Distinct Counts
A naive approach collects distinct items in each partition separately and then merges them. If you store exact sets, you can deduplicate across partitions and get the global distinct count.
Approximate algorithms like HyperLogLog can produce sub-estimates in each partition, which are then merged using an algorithm-specific merge function.
Load Balancing Pitfalls
If partitions are uneven, or if draws are not truly distributed uniformly across machines, your partial distinct counts may bias the final result.
Communication overhead and the method of merging partial distinct sets can introduce errors or complexities (e.g., the same item might appear in multiple partitions, but you must ensure it’s counted only once overall).
Scalability
For extremely large data sets, distributed solutions are typically essential. The fundamental probability results still hold locally, but the implementation details around partitioning, merging, and approximate data structures will determine how accurately you can estimate the global distinct count.
In practice, these distributed scenarios heavily rely on algorithms designed to merge partial cardinality estimates accurately. Understanding how local computations of distinct items aggregate is crucial to scaling up.
Are there any real-world complications involving data cleanliness or label noise?
Yes. In industrial applications, you often face messy or incomplete data.
Duplicate Labels
Sometimes the data might contain mislabeled or duplicated entries which artificially increase or decrease the count of distinct items. For example, different IDs for the same real-world entity.
You might need data-cleaning steps (e.g., record linkage, deduplication) before counting distinct values truly makes sense.
Corrupted or Missing Data
Missing data can artificially reduce the number of distinct items. Corrupt data might appear to be entirely new items when they aren’t.
Probability-based counting might not directly account for how the data was lost or corrupted, leading to biases in the final tally.
Noise
In sensor networks or distributed systems, random noise in measurements can create “fake” new IDs or cause repeated items to appear unique.
De-noising or smoothing techniques are often required if you want an accurate unique count.
Hence, the nice theoretical formulas assume each label is pristine and consistent. Data ingestion pipelines often require extensive cleaning, normalization, and deduplication for the result to align with theoretical expectations.