ML Interview Q Series: Bayesian Estimation of True IQ from Test Score Using Normal Conjugate Model

May 30, 2025

Browse all the Probability Interview Questions here.

A job applicant will have an IQ test. The prior density of his IQ is the N(μ₀,σ₀²) distribution with μ₀=100 and σ₀=15. If the true value of the IQ of the job applicant is x, then the test will result in a score that has the N(x,σ₁²) distribution with σ₁=7.5. The test results in a score of 123 points for the job applicant. What is the posterior density f(θ | data) of the IQ of the job applicant? Give the value of θ for which the posterior density is maximal and give a 95% Bayesian confidence interval for θ.

Short Compact solution

The posterior density turns out to be a normal density with mean 118.4 and variance 45. Therefore, the standard deviation is about 6.708. The value of θ that maximizes the posterior density is 118.4 (the posterior mean for this conjugate normal model). A 95% Bayesian confidence interval, using the standard normal quantiles ±1.96 around the mean, is approximately (105.3, 131.5).

Comprehensive Explanation

Posterior Density Derivation

Because we have a normal prior and a normal likelihood (with known variances), the posterior distribution for θ (the candidate’s true IQ) is also normal. Specifically:

Prior: θ ~ N(μ₀, σ₀²), where μ₀=100 and σ₀=15, so σ₀²=225.
Likelihood for the observed test score t₁=123: t₁ | θ ~ N(θ, σ₁²), where σ₁=7.5, so σ₁²=56.25.

By standard Bayesian conjugacy results for Normal–Normal models, the posterior distribution f(θ | data) has the form:

where:

$$\mu ;=;\frac{\sigma_{1}^{2},\mu_{0};+;\sigma_{0}^{2},t_{1}}{\sigma_{0}^{2};+;\sigma_{1}^{2}}\quad,\quad

\sigma^{2};=;\frac{\sigma_{0}^{2},\sigma_{1}^{2}}{\sigma_{0}^{2};+;\sigma_{1}^{2}}$$

Below are the interpretations of these parameters in plain text:

μ is the posterior mean of θ given the data.
σ² is the posterior variance of θ given the data.
μ₀=100 is the prior mean.
σ₀²=225 is the prior variance.
t₁=123 is the observed test score.
σ₁²=56.25 is the variance of the measurement process (the test).

Numerical Calculation of Posterior Mean and Variance

Posterior mean (μ):
μ = (σ₁²·μ₀ + σ₀²·t₁) / (σ₀² + σ₁²)
Substituting:
- σ₀²=225
- σ₁²=56.25
- μ₀=100
- t₁=123
So we compute: numerator = 56.25·100 + 225·123 = 5625 + 27675 = 33300 denominator = 225 + 56.25 = 281.25 μ = 33300 / 281.25 ≈ 118.4
Posterior variance (σ²):
σ² = (σ₀²·σ₁²) / (σ₀² + σ₁²)
Substituting: σ² = (225·56.25) / 281.25 = 12656.25 / 281.25 = 45
Hence the posterior standard deviation σ is √45 ≈ 6.708.

Maximum a Posteriori (MAP)

For a normal posterior distribution, the mode (the value of θ that maximizes the posterior) is the same as the mean. Therefore, the posterior is maximized at θ=118.4.

95% Bayesian Confidence Interval

A 95% credible interval under a normal posterior distribution is typically given by:

[ μ - 1.96·σ , μ + 1.96·σ ]

Substituting μ=118.4 and σ≈6.708:

Lower bound = 118.4 - 1.96×6.708 ≈ 105.3
Upper bound = 118.4 + 1.96×6.708 ≈ 131.5

Thus, the 95% Bayesian confidence interval for θ is approximately (105.3, 131.5).

Practical Implementation Example in Python

Below is a brief illustration of how you might numerically compute these posterior parameters in Python:

import math

# Prior parameters
mu0 = 100.0
sigma0 = 15.0
sigma0_sq = sigma0 ** 2

# Likelihood (test) parameters
sigma1 = 7.5
sigma1_sq = sigma1 ** 2

# Observed test score
t1 = 123.0

# Posterior mean
mu_posterior = (sigma1_sq * mu0 + sigma0_sq * t1) / (sigma0_sq + sigma1_sq)
# Posterior variance
sigma_sq_posterior = (sigma0_sq * sigma1_sq) / (sigma0_sq + sigma1_sq)
sigma_posterior = math.sqrt(sigma_sq_posterior)

# MAP estimate
map_estimate = mu_posterior  # same as posterior mean

# 95% CI
z_value = 1.96  # for ~95% coverage
lower_bound = mu_posterior - z_value * sigma_posterior
upper_bound = mu_posterior + z_value * sigma_posterior

print("Posterior Mean =", mu_posterior)
print("Posterior Variance =", sigma_sq_posterior)
print("Posterior Std Dev =", sigma_posterior)
print("MAP Estimate =", map_estimate)
print("95% CI = ({:.1f}, {:.1f})".format(lower_bound, upper_bound))

Follow-up Question 1: Why does the posterior mode coincide with the posterior mean for this model?

In a Normal–Normal conjugate setting with known variance, the posterior is itself a normal distribution in θ. For a normal distribution, the location of the maximum density (the mode) equals the mean. This property applies whenever we have a Gaussian prior and Gaussian likelihood, both with known variances.

Follow-up Question 2: How does changing the prior variance σ₀² affect the posterior?

If σ₀² is large, it means the prior is more diffuse (less certain about the true IQ). The posterior then leans more heavily on the data (the test score) because the prior is weak. As a result, the posterior mean shifts closer to the test result of 123.

If σ₀² is small, it indicates strong prior belief (high certainty) about the candidate’s IQ being near 100. The posterior then stays closer to 100, and the influence of the test score is diminished.

Follow-up Question 3: How do we interpret the 95% Bayesian confidence interval here compared to a frequentist confidence interval?

Bayesian Credible Interval: We can interpret it as: “There is a 95% probability that the true IQ θ lies within (105.3, 131.5), given our prior and the observed test score.”
Frequentist Confidence Interval: Would typically say, if we were to repeat this experiment many times, 95% of such constructed intervals would contain the true IQ parameter. A frequentist interpretation avoids the language of probability on the parameter itself.

Follow-up Question 4: What if we had multiple test scores from the same person?

With multiple observed scores (t1, t2, …, tn), each with variance σ₁² (assuming independence), the likelihood becomes the product of normal densities centered on θ. The resulting posterior is still normal with updated formulas:

Posterior mean = ( (1/σ₀²)*μ₀ + ∑(ti / σ₁² ) ) / ( (1/σ₀²) + n/σ₁² )
Posterior variance = 1 / ( (1/σ₀²) + n/σ₁² )

Each observation pulls the posterior mean further away from the prior mean toward the average of the observed test scores, with combined variance reflecting both the prior uncertainty and the sampling variance of the tests.

All these aspects highlight how powerful and convenient the Normal–Normal conjugate framework can be for updating beliefs about a parameter (like IQ) as more data comes in.

Below are additional follow-up questions

How sensitive is the posterior to the assumption that the data (IQ test score) is normally distributed?

In the Normal–Normal conjugate model, both the prior and likelihood distributions are assumed to be normal. If the actual data has heavier tails, skew, or other deviations from normality, the posterior might be overly confident or mis-specified. In real-world IQ testing, the normality assumption often holds reasonably well for moderate ranges but can become problematic for extreme scores.

Pitfall: If the actual score distribution is significantly non-Gaussian (e.g., large outliers, severe skew), a single extreme value might incorrectly push the posterior distribution too far.
Mitigation: One could model the test score errors with a more robust distribution (like a Student’s t distribution) or perform diagnostic checks (e.g., posterior predictive checks) to see if the normality assumption is reasonable.

How would we proceed if we suspected that the test is systematically biased?

Sometimes IQ tests can have biases: either the test systematically overestimates (or underestimates) a group’s IQ, or there might be a constant offset. In such cases, the likelihood model t ~ N(θ + b, σ₁²) might include a bias term b.

Adjustment in the model: We could treat b as an unknown parameter with its own prior, or if we suspect a known offset (e.g., test systematically overstates IQ by 5 points), we’d shift observed scores by that offset before updating.
Pitfall: If b is neglected and there is a non-zero real bias, the posterior will systematically drift away from the true IQ.
Mitigation: Gather calibration data or external validation to estimate b, or use hierarchical modeling where the bias is itself a parameter to be inferred.

What if we do not know σ₁ but instead must estimate it from the data?

In many practical settings, the standard deviation of test scores around the true IQ may not be precisely known. Instead, we might have a prior for σ₁, or we might jointly estimate θ and σ₁ from the data. That changes the conjugate structure because Normal–Normal conjugacy assumes a fixed known variance.

Impact on posterior: The posterior distribution for θ will typically widen because of the extra uncertainty in σ₁. Instead of a simple normal posterior, we might have a Student’s t posterior if σ₁ is given an inverse-gamma prior (Normal–Inverse-Gamma conjugate model).
Pitfall: Underestimating σ₁ artificially narrows the posterior, giving a false sense of precision in the IQ estimate.
Mitigation: Use empirical estimates of σ₁ from multiple test administrations or adopt a hierarchical Bayesian model that captures uncertainty in σ₁.

How do we account for the possibility of cheating or random guessing in the test result?

If there is a possibility that the test taker has artificially inflated their score or guessed answers, the relationship between θ and t might not be strictly t ~ N(θ, σ₁²). For instance, cheating could produce a test score distribution centered on θ + c for some c>0, or possibly with an inflated variance.

Revised model: Introduce a mixture model: with some probability p the test is genuine, and with probability 1−p it is fraudulent (leading to inflated scores). The posterior then becomes a mixture of the normal posterior and some alternative distribution for the inflated score case.
Edge case: If the test score is suspiciously high, the mixture model might weight the “inflated” scenario more strongly, resulting in a more conservative posterior (i.e., less certain that the candidate’s true IQ is extremely high).
Pitfall: Failing to consider such cheating can lead to systematically biased inferences.

How do we handle extremely high or low scores that are at the limit of the test?

Some IQ tests have upper ceilings (e.g., you cannot score above 160 on a standard test). When someone hits the maximum score, it introduces truncation or censoring. The same can happen at very low scores.

Censoring/truncation approach: Instead of modeling the observed value as t ~ N(θ, σ₁²), we note that t is only observed up to a maximum. This requires modifying the likelihood for these boundary cases.
Pitfall: If the candidate hits the upper bound, the naive model might incorrectly treat that as exactly t=160, while in reality their ability might be even higher. This underestimates θ.
Mitigation: Use truncated likelihoods: P(t≥160 | θ). The posterior then accounts for the fact that the test result could be any number above 160.

Could prior-data conflict arise, and how would we detect it?

Prior-data conflict occurs if the observed data strongly contradicts the prior assumptions. For example, if the prior strongly centers around 100 with small variance, but the applicant’s score is exceptionally high (like 150 or more), there might be an internal inconsistency between prior and data.

Detection: We can compare how much the posterior shifts away from the prior. If the posterior is in a region that was initially considered highly unlikely under the prior, it might be a sign of conflict.
Pitfall: If the prior is too rigid (small σ₀) and real data is far outside that range, the posterior might still be relatively close to the prior, leading to substantial underestimation of the candidate’s true IQ.
Mitigation: Use more flexible or weakly informative priors so that truly extreme but plausible data can still meaningfully adjust the posterior.

How does sampling from the posterior help in more complex scenarios?

When extending beyond simple conjugate models or if we want to capture uncertainties in multiple parameters, we often resort to simulation-based methods (e.g., Markov Chain Monte Carlo). This allows us to approximate the posterior distribution numerically.

Utility: By drawing posterior samples of θ (and possibly other parameters like bias, unknown variance, etc.), we can characterize uncertainty more fully, generate predictive distributions, and handle more complex hierarchical models.
Pitfall: Poor choice of priors or insufficient MCMC convergence can lead to unreliable posterior estimates.
Edge case: If the posterior has multiple modes or is highly skewed, one must ensure that the sampling algorithm adequately explores these regions.

How does correlation between multiple IQ tests for different individuals impact the posterior for a single individual?

In a simple model, we treat each individual’s IQ as independent. However, in reality, there might be correlations (e.g., shared environment, educational background). For a single candidate, if you use data from other test takers to inform the prior distribution, correlation can complicate how the prior is updated.

Hierarchical modeling: One might assume each individual’s IQ is drawn from a population-level distribution with hyper-parameters. Then, observations from all individuals help refine that population-level distribution, which feeds back into the individual-level posterior.
Pitfall: Ignoring correlations can lead to overly confident results if the data are not truly independent.
Mitigation: Incorporate correlation structures or hierarchical frameworks so that the posterior reflects both individual test results and population-level insights without incorrectly assuming independence.

What if we suspect a learning effect or fatigue effect if multiple tests are taken sequentially?

If the candidate takes multiple IQ tests in short succession, the first test might affect subsequent tests. This dependency violates the assumption that each test’s result is conditionally independent given θ.

Modeling approach: Introduce terms for learning or fatigue across multiple sequential tests. The likelihood might be tᵢ ~ N(θ + δᵢ, σ₁²), where δᵢ is a systematic effect for test i (perhaps δᵢ grows or decreases over i).
Pitfall: Treating repeated tests as fully independent can lead to overly narrow posterior intervals if the candidate’s performance is consistently improving or deteriorating.
Mitigation: Estimate or set a prior on δᵢ. For example, δᵢ could come from a random effects distribution. Then the model recognizes the correlation of repeated measures on the same individual.

Rohan's Bytes

Discussion about this post