ML Interview Q Series: Bayesian Inference for Basketball Free Throw Success with Discrete Priors.
Browse all the Probability Interview Questions here.
Your friend is a basketball player. To find out how good he is in free throws, you ask him to shoot 10 throws. You assume the three possible values 0.25, 0.50 and 0.75 for the success probability of the free shots of your friend. Before the 10 throws are shot, you believe that these three values have the respective probabilities 0.2, 0.6 and 0.2. What is the posterior distribution of the success probability given that your friend scores 7 times out of the 10 throws?
Short Compact solution
The prior distribution over the three possible values of the success probability is p0(0.25) = 0.2, p0(0.50) = 0.6, p0(0.75) = 0.2. Given 7 successes in 10 throws, the likelihood of observing 7 successes is proportional to (theta^7 (1−theta)^3). Multiplying each likelihood by its corresponding prior and normalizing, we get:
p(0.25 | data) = 0.0051
p(0.50 | data) = 0.5812
p(0.75 | data) = 0.4137
Comprehensive Explanation
Bayesian inference in this setup begins with a discrete prior distribution over the parameter theta, which represents the probability that your friend makes a free throw. You have three candidate values for theta: 0.25, 0.50, and 0.75. These are assigned prior probabilities of 0.2, 0.6, and 0.2 respectively. The experiment is shooting 10 free throws and counting the number of successes (which is 7).
To update the prior distribution in light of the new data (7 successes out of 10), we apply Bayes’ rule:
Explanation of terms in the formula:
p(theta | data) is the posterior probability that the true success probability is theta (i.e., 0.25, 0.50, or 0.75) given the observed data.
The numerator consists of the binomial likelihood (computed as the combination 10 choose 7 multiplied by theta^7 (1−theta)^3) multiplied by the prior p0(theta).
The denominator is simply the normalizing constant that ensures all posterior probabilities sum to 1. It is the sum of the numerators for each possible value of theta in {0.25, 0.50, 0.75}.
Since 10 choose 7 is the same for all theta, it cancels out in the ratio, but is explicitly shown to maintain conceptual clarity. Hence, the posterior for each candidate theta is proportional to the prior multiplied by theta^7 (1−theta)^3.
By performing these calculations, we obtain posterior probabilities approximately:
p(0.25 | data) = 0.0051
p(0.50 | data) = 0.5812
p(0.75 | data) = 0.4137
These values form a valid probability distribution that sums to 1. It shows that, given the observed data of 7 successes in 10 attempts, the highest posterior probability is on theta = 0.50, followed by theta = 0.75, with theta = 0.25 being very unlikely.
What if the friend made 7 shots out of 8 and then missed the last 2?
If the total number of successes and attempts changes, the likelihood part of the computation would use the binomial formula with the new counts. For instance, if your friend went 7 for 8 initially and then missed 2 in a row, we would consider 7 total successes out of 10 again, so the final posterior remains the same in that scenario because the final count is still 7 successes in 10 shots. Bayesian inference focuses on the total number of successes and failures, and does not distinguish their sequence as long as the shots are assumed to be independent.
How does one implement this calculation in Python?
Below is a simple illustrative code snippet:
import math
# Possible values of theta
thetas = [0.25, 0.5, 0.75]
# Prior
priors = [0.2, 0.6, 0.2]
# Number of successes and failures
k = 7
n = 10
k_fail = n - k
# Compute posteriors
numerators = []
for theta, p0 in zip(thetas, priors):
likelihood = (theta**k) * ((1-theta)**k_fail)
numerators.append(likelihood * p0)
denominator = sum(numerators)
posteriors = [num / denominator for num in numerators]
print("Posterior distribution:", posteriors)
The output of this snippet will be approximately [0.0051, 0.5812, 0.4137].
What if the number of possible values of theta were larger?
In practice, we might consider a continuous prior on theta over the interval [0,1]. However, doing so analytically would typically require a Beta prior and a Beta posterior. In a discrete case with more possible values, we just sum over each discrete candidate in the denominator. The computational overhead grows with the number of candidate points, but it remains straightforward to implement in code.
Could one incorporate prior information from past experience?
Yes. If you know more about your friend’s free-throw skill (for example, you have already seen them shoot hundreds of times), you could place a more concentrated prior, say around 0.70 if you believe your friend is fairly proficient. In the discrete setup, you would assign higher probabilities to values near 0.70. In a continuous Beta prior approach, you would choose hyperparameters to reflect your belief (e.g. Beta(α, β) with α/ (α+β) close to 0.70).
What are potential pitfalls with discrete priors in real-world scenarios?
Since the real probability of making a free throw is not necessarily restricted to a handful of points like 0.25, 0.50, and 0.75, using just three discrete values can be limiting. If the true probability is, say, 0.65, your discrete prior might give inaccurate posterior estimates. Also, with very limited data (like just 10 shots), results can be sensitive to the chosen discrete points. A Beta prior (continuous) is more flexible because it allows us to represent a continuum of beliefs about the success probability.
How could this approach extend beyond basketball free throws?
The same Bayesian updating principle applies to any binomial-type process—like website click-throughs (success/failure), medical test outcomes (positive/negative), or manufacturing defect rates (defect/no defect). You start with a prior over the parameter (probability of a success) and update based on observed data. This is a fundamental concept in Bayesian statistics and can be implemented in a wide variety of domains where data can be modeled as Bernoulli trials.
Below are additional follow-up questions
If the friend’s shot attempts are not independent, how would that affect the posterior?
When we perform Bayesian updating under the binomial assumption, we rely on the independence of each shot attempt. In real scenarios, success on one shot can sometimes affect confidence or fatigue for the next shot. If independence does not hold, then the binomial likelihood might no longer be accurate. Instead, we would need a more complex model that accounts for correlation between shots.
For instance, if we suspect a “hot hand” phenomenon (where a recent success increases the odds of the next success), we could use a Markov chain or a hidden state model where the probability of success depends on the outcome of the previous shot(s). Bayesian updating would then involve the likelihood derived from that more complex model. A potential pitfall is that if we ignore these dependencies and apply a simple binomial model, we might incorrectly assume we have more or less evidence for a particular success probability. That can lead to a posterior that is either too concentrated or too diffuse. Testing for independence and validating the assumptions on real data is critical.
How would the posterior change if the prior was extremely skewed, say p0(0.25) = 0.9, p0(0.50) = 0.05, p0(0.75) = 0.05?
In such a scenario, before seeing any data, we are heavily biased toward believing the success probability is 0.25. After observing 7 successes in 10 shots, the likelihood factor will multiply with these priors. Although the data strongly suggests a higher success probability, the prior might still pull the posterior toward 0.25 more than it otherwise would. Specifically, the posterior for 0.25 will still be relatively small (because 7 successes out of 10 is not very consistent with 0.25), but it might be larger than in a scenario where the prior is more balanced.
A key lesson is that a strongly skewed prior can dominate the posterior unless the data is overwhelmingly indicative of a different conclusion. This can be both a strength and a weakness of Bayesian methods: a strong prior can enforce domain knowledge but can also lead to underweighting of new data if chosen poorly.
How would we update our prediction for the 11th shot using these posterior probabilities?
To predict the probability that the friend makes the next (11th) free throw, we can do a posterior predictive calculation. In the discrete case, we simply weight each candidate theta by its posterior probability and sum:
Posterior predictive for a new shot = p(0.25 | data)*0.25 + p(0.50 | data)*0.50 + p(0.75 | data)*0.75
With the posterior distribution {0.0051, 0.5812, 0.4137} for {0.25, 0.50, 0.75}, that prediction would be approximately: (0.0051 * 0.25) + (0.5812 * 0.50) + (0.4137 * 0.75)
Once computed, this yields a single probability estimate of hitting the next shot. A pitfall here is that if we forget to average over all possible thetas (or if we simply used the single highest-posterior theta), we might underestimate or overestimate the true predictive distribution. Bayesian practice typically marginalizes over the parameter rather than plugging in an estimate.
What if we had partial observability, such as not knowing all the shots taken or missing data?
In real data collection, it’s possible we only observe some subset of shots (e.g., missing data when our friend practices alone). A straightforward binomial likelihood assumes we know both successes and the total number of attempts. If the total is uncertain or only partially observed, the likelihood is no longer simply theta^k (1−theta)^(n−k). Instead, we might need to consider a model that accounts for missingness: for instance, we might only know the successes but not the exact number of attempts.
This complicates posterior updating because we now have to integrate or sum over the possible unknown data, or make assumptions about the missing data mechanism (e.g., whether data are missing at random, missing systematically, etc.). A pitfall is that ignoring missing attempts might bias the posterior, especially if the missingness is not random (for example, if your friend only shows you the best shots).
How do we handle the situation where the probability of a success evolves over time?
A basketball player’s skill might improve or degrade over time, or it might vary with factors like fatigue, pressure, or injuries. The assumption of a single fixed success probability theta for all 10 shots might then be unrealistic. In that case, a dynamic model would be more appropriate. For example, we could use a Bayesian updating scheme where theta itself changes from shot to shot according to a state-space model, or we could split the data into time segments (e.g., first five shots vs. last five shots) and estimate different probabilities.
A big subtlety here is deciding how quickly we allow theta to change. If we let theta change too frequently, we may overfit short-term fluctuations. If we keep it constant, we might fail to capture trends. The complexity of the model must match the complexity of the real process.
What happens if we increase the total number of possible discrete values for theta?
Right now we have three discrete points (0.25, 0.50, 0.75). A natural question is what if we expand this to a finer grid, say [0.0, 0.01, 0.02, ..., 1.0]? In principle, this would give a more precise representation of our beliefs. The Bayesian updating remains essentially the same: we compute the likelihood for each candidate theta, multiply by its prior, and normalize.
However, with a finer grid, we have more computations to do (though still manageable for a typical computer). In practice, using a Beta prior in the continuous case is typically more straightforward. A possible pitfall is choosing too coarse or too fine of a grid. Too coarse and we lose accuracy; too fine and we might waste computation if there is no real improvement in the accuracy of the posterior for the question at hand.
Could outlier observations significantly affect the posterior?
Yes. If we see an unusually high (or low) number of successes that differs drastically from our prior expectation, the posterior can shift significantly. With only 10 shots, a string of unusual outcomes (like 10 makes in a row) can dramatically swing the posterior in a small-data scenario. The risk of overconfidence is also there: with 10 successes out of 10, we might push the posterior strongly toward 0.75 (or whichever is the highest discrete value available) despite 10 shots possibly being insufficient evidence to make strong conclusions about the true long-run probability.
Hence, it is crucial to balance prior beliefs with the impact of outliers. If we suspect that unusual outcomes can occur for extraneous reasons (e.g., a particularly good or bad day), we might consider a more robust or hierarchical modeling approach that can accommodate potential anomalies.
How would sequential updating be performed after each shot, rather than after all 10 shots?
We can update the posterior shot by shot. After each attempt, we record whether it was a success or a failure, compute the new posterior using the old posterior as the prior for the next step, and repeat. For discrete theta values 0.25, 0.50, 0.75, the formula at each step is proportional to old_posterior(theta) * theta^{(1 if success else 0)} * (1 - theta)^{(1 if failure else 0)}. Then we normalize.
A subtlety is that after each shot, we might have to keep track of the entire updated distribution. This is straightforward computationally for a small discrete set but can be more cumbersome for a continuous distribution. A code-based approach is often simpler: after each shot, multiply the posterior distribution by the likelihood of the new outcome, normalize, and proceed. This is sometimes known as online learning or sequential Bayesian updating.
What if we want to combine data from multiple friends, each possibly having different shooting probabilities?
If we have data from multiple friends, each with different unknown probabilities, we might consider a hierarchical Bayesian model. For example, we could assume each friend’s true shooting probability is drawn from some higher-level Beta distribution. Then, for each friend, we observe a certain number of shots made out of total attempts. We can jointly learn both the hyperparameters of the Beta distribution (which represent population-level parameters) and the individual probabilities for each friend.
The pitfall here is that if we just lump everyone together in a single binomial model, we ignore the fact that some friends might be significantly better or worse than others. We also lose the benefits of partial pooling, which can help us learn faster about each individual by leveraging information across the entire group—especially useful when the data on each individual alone is sparse.
How could the model be validated or checked for goodness of fit?
After obtaining the posterior distribution, one might want to check if the model’s predictions align well with real outcomes. For instance, we can perform posterior predictive checks:
We can simulate new sets of free throws from the posterior distribution of theta and compare the distribution of simulated successes (out of 10) with the actual observed data.
We could use a Bayesian p-value or other metrics to see if the observed number of successes is plausible under our posterior.
A pitfall is that if we see systematic mismatches (like the real data shows clustering of successes that the model can’t explain), that might indicate we need a more complex or different model. Validation is critical because an apparently “correct” posterior might mask deeper modeling flaws.