ML Case-study Interview Question: Monte Carlo Simulation: Finding Sample Size for Reliable Extension Quality Tracking.

Rohan Paul

Apr 11, 2025

Case-Study question

A large platform has a marketplace of many software extensions. Quality is monitored by checking how many requirements each extension violates. They cannot review every extension, so they collect a random sample each month and compute the average number of violations. That sample-based average is their key success metric. They also attempt to improve extension quality by auditing and enforcing new policies. They want to know how many extensions to sample so they can reliably detect month-to-month quality improvements in their marketplace.

Connect with me on X (Twitter)

Propose a solution. Specifically, outline how you would:

Use a Monte Carlo simulation to model different sample sizes and monthly improvement rates.
Suggest how to choose the best sample size given the cost of reviewing more extensions versus the need for a reliable metric trend.
Show the steps, code snippets, and any mathematical rationale needed to justify your approach.
Present all relevant formulas and reasoning.

Detailed Solution

Simulation Overview

A Monte Carlo simulation creates synthetic datasets under assumptions about the real distribution of requirement violations and the rate of monthly improvement. It samples data multiple times to quantify how often the observed sample metric aligns with the true underlying metric trend.

Distribution Choice

Requirement violations can often be modeled with a Poisson distribution, because each extension’s requirements can be treated as a series of independent checks. The Poisson distribution has a single parameter lambda, which is the mean violation rate across the population.

X is the random variable representing the number of violations, k is a nonnegative integer, and lambda is both the mean and variance of the distribution. In code, you can sample values from this Poisson for each extension under review.

Monthly Improvement

Assume an underlying percentage decrease in lambda each month (for example, 5 percent). This represents the true trend in the population. The simulation randomly draws violation counts from the decreasing Poisson distribution every month for a set number of months (for instance, 12 months).

Implementing the Simulation in Python

Below is a rough example. This code uses pandas and numpy. It generates a series of synthetic data for each month, based on a chosen initial lambda, monthly decrease, and sample size.

import numpy as np
import pandas as pd

def generate_time_series(initial_lambda, monthly_decrease, audits_per_month, months=12):
    data = []
    current_lambda = initial_lambda
    for month in range(1, months+1):
        # Sample from Poisson
        sample_counts = np.random.poisson(lam=current_lambda, size=audits_per_month)
        data.append({
            'month': month,
            'sample_mean': sample_counts.mean(),
            'true_mean': current_lambda
        })
        # Decrease lambda
        current_lambda = current_lambda * (1 - monthly_decrease)
    return pd.DataFrame(data)

This function returns month-by-month rows of sampled means. It simulates what you would see if the true mean indeed dropped each month.

Assessing Variability

Run the generation function multiple times, then track how frequently the observed sample trend matches the true decreasing trend. Also track the mean absolute percentage error (MAPE). One formula for MAPE in each month is:

A_t is the true mean for month t, and F_t is the sampled mean for month t. Summing across multiple simulation runs shows how stable the sampled metric is.

Search Over Parameter Combinations

Vary parameters like sample size and monthly decrease. A grid search approach runs the simulation across these combinations. You then compute metrics like:

How often does the sample-based metric decrease from month to month (matching the true decrease)?
What is the average MAPE?

Larger sample sizes reduce variance but increase cost. Smaller sample sizes increase uncertainty but reduce cost. An ideal choice balances these factors.

Follow-Up Questions

How do you handle unknown distributions if Poisson does not fit?

Test other distributions (for example, normal or negative binomial) and compare how well they model real data. In practice, you would collect an initial sample and plot the distribution of violations. If you see overdispersion relative to Poisson, you might use a negative binomial. You would implement the same Monte Carlo approach but change the random draws to match the alternative distribution.

What if monthly improvements are not constant?

Include a function that models a nonlinear or stepwise monthly improvement. For example, set smaller improvements in some months and larger in others. Let the simulation sample from a time-varying lambda. You still rely on repeated sampling to see how well the observed trend tracks the true pattern.

How do you ensure enough simulation runs?

Run enough iterations to produce stable estimates of the variability metrics. Measure the standard error of those estimates across iterations. Increase iterations until the estimates no longer change significantly.

How would you apply hypothesis testing instead of just descriptive metrics?

Construct a null hypothesis that the metric has not improved. Collect monthly samples and track the difference between successive months. Under the null, run a bootstrapped distribution of differences. If your observed difference is consistently beyond a confidence threshold (like 95%), you reject the null and conclude there is a real improvement.

How do you communicate these results to non-technical teams?

Present numeric outcomes showing how many months reliably show a decrease. Show that with a sample size of X, you see a Y% chance of catching a real monthly improvement. Emphasize the cost-benefit tradeoff of more sampling versus more confidence. Use simple charts to visualize how the metric tracks the true trend under different scenarios.

What if the time frame is long (e.g., multiple years)?

Extend the simulation range and monthly loops. Or run a rolling simulation that updates parameters if you suspect changes over time. The same fundamental approach applies, but you simulate for more months. Continually compare the observed and true means to assess how sample size interacts with the length of observation.

How do you decide final sample size?

Look at simulation outputs for each tested sample size and monthly decrease assumption. Identify the point where MAPE drops to an acceptable level or month-to-month detection accuracy hits a desired threshold. If the cost of increasing sample size outweighs the benefit, choose the smaller size. Otherwise, collect more data to reduce uncertainty.

Rohan's Bytes

Discussion about this post