ML Interview Q Series: Calculating True Disease Probability After Positive Tests Using Bayes' Theorem
📚 Browse the full ML Interview series here.
Bayesian Probability (Medical Test Scenario): You’re testing for a rare disease that affects 1% of the population with a test that has 90% sensitivity (true positive rate) and 5% false positive rate. If a person tests positive, what is the probability they actually have the disease? *Walk through how you arrive at the answer using Bayes’ Theorem.*
Understanding the Core Idea Using Bayes’ Theorem
Bayes’ Theorem provides a way to update our belief about an event (in this case, having the disease) based on new evidence (a positive test). When we say the disease affects 1% of the population, that is known as the prior probability of having the disease. The test’s sensitivity describes how likely it is to detect the disease if the person truly has it, and the false positive rate tells us how often the test incorrectly indicates the disease when it is not present.
Bayes’ Theorem states, in its fundamental form:
where:
( D ) is the event that the person has the disease.
( \neg D ) is the event that the person does not have the disease.
( P(D) ) is the prior probability of having the disease (1%).
( P(+ \mid D) ) is the sensitivity or true positive rate (90%).
( P(+ \mid \neg D) ) is the false positive rate (5%).
( P(\neg D) = 1 - P(D) = 99% ).
Detailed Walk-Through
To compute ( P(D \mid +) ), we need two main components in the denominator:
The probability of a true positive among those who have the disease.
The probability of a false positive among those who do not have the disease.
Plugging in the numbers:
( P(D) = 0.01 )
( P(\neg D) = 0.99 )
( P(+ \mid D) = 0.90 )
( P(+ \mid \neg D) = 0.05 )
Numerator: ( 0.01 \times 0.90 = 0.009 )
Denominator: ( 0.009 + (0.99 \times 0.05) = 0.009 + 0.0495 = 0.0585 )
Hence,
So the probability that the person actually has the disease given a positive test is about 15.38%. This may seem counterintuitive: one might expect a 90% accurate test to yield a much higher chance of actually having the disease. However, the low prevalence (1%) combined with the 5% false positive rate causes many more false positives among healthy individuals than true positives among diseased individuals.
Illustration with a Python Code Snippet
p_disease = 0.01 # 1%
sensitivity = 0.90 # 90%
false_positive_rate = 0.05 # 5%
# Probability of testing positive overall
p_positive = (p_disease * sensitivity) + ((1 - p_disease) * false_positive_rate)
# Posterior probability: probability of disease given a positive test
posterior_probability = (p_disease * sensitivity) / p_positive
print(posterior_probability) # ~0.1538
Explanation of Why the Probability Is Only Around 15.38%
In real-world scenarios, if the underlying event (the disease) is relatively rare, even a test with a seemingly good sensitivity can yield a low probability of actually having the disease once a person tests positive. This phenomenon underscores the importance of considering prevalence (the prior probability) along with test accuracy measures when interpreting diagnostic results.
Subtleties and Practical Insights
False positives can easily outnumber true positives when the overall prevalence is low. That is why confirmatory tests, possibly with higher specificity or a different testing modality, are often used in medical diagnostics. Also, if the disease prevalence increases, the posterior probability ( P(D \mid +) ) can rise significantly, because more of the positives will be true positives.
How does the result change if the disease prevalence increases?
If the disease prevalence rises from 1% to, say, 5%, that changes the prior ( P(D) ). Using the same sensitivity and false positive rate:
( P(D) = 0.05 )
( P(\neg D) = 0.95 )
( P(+ \mid D) = 0.90 )
( P(+ \mid \neg D) = 0.05 )
Numerator: ( 0.05 \times 0.90 = 0.045 ) Denominator: ( 0.045 + (0.95 \times 0.05) = 0.045 + 0.0475 = 0.0925 ) Posterior: ( 0.045 / 0.0925 \approx 0.4865 ), or about 48.65%.
This demonstrates how a higher prevalence significantly increases the probability of having the disease given a positive test result.
Why is it critical to distinguish between sensitivity and specificity?
Sensitivity is ( P(+ \mid D) ), the true positive rate. Specificity is ( P(- \mid \neg D) ), which is 1 minus the false positive rate. If the false positive rate is 5%, the specificity is 95%. These measures serve different roles:
Sensitivity answers: “If someone has the disease, how often do they test positive?” Specificity answers: “If someone does not have the disease, how often do they test negative?”
Knowing both helps one understand how frequently the test might miss actual cases (false negatives) and how frequently it might incorrectly flag healthy individuals as diseased (false positives).
What are common pitfalls in applying Bayes’ Theorem in real-world medical testing?
One pitfall is misunderstanding the impact of low prevalence (also called prior probability). People often overlook how a small prior probability can dramatically reduce the chance that a positive test result corresponds to a true case of the disease. Another pitfall is the assumption that sensitivity and false positive rate (or specificity) are constant across different populations or testing contexts. In practice, these rates can vary based on demographics, test administration conditions, or biological differences.
Overconfidence in a test’s accuracy is another concern. A 90% sensitivity might sound excellent, but without considering specificity and prevalence, the final probability could be substantially lower than anticipated.
Why do we rely on priors, and can they change?
Priors represent our existing beliefs or knowledge before new data arrives. In a medical context, the prior probability might come from large-scale epidemiological data or population studies. These priors can change if new information is introduced—for instance, if the individual has certain symptoms, belongs to a high-risk demographic, or if new research shows changes in disease frequency. When priors change, the posterior probability must be recalculated, which is precisely what Bayes’ Theorem accommodates.
Are there alternative approaches if priors are difficult to estimate?
When priors are uncertain, some methods use broader ranges or distributions for the disease prevalence. For example, one might employ Bayesian hierarchical models or sensitivity analyses, allowing one to see how varying the prior probability influences the posterior. Another approach is to gather more data before administering the test (e.g., additional screening questions or preliminary tests) to refine the prior.
How can real-world performance of tests differ from theoretical measures?
Real-world performance can differ due to:
Imperfect conditions during sample collection or handling.
Differences between the studied population and real-world population.
Operator or machine variability.
Time-dependent factors, such as disease stage.
Hence, sensitivity and specificity from clinical trials might not exactly match real-world conditions. Post-marketing surveillance and external validation studies can clarify these performance metrics.
How do we reduce the impact of false positives when the disease is rare?
Medical practitioners may employ two-stage testing: a first, relatively inexpensive or widely administered test with high sensitivity (few false negatives), then a confirmatory test with higher specificity. This approach can reduce the number of healthy individuals being incorrectly identified as diseased. Also, prevalence-based screening programs often include risk-factor assessment (age, family history, lifestyle) to refine who gets tested.
Is Bayes’ Theorem only useful in medical diagnostics?
While it is vital in medical testing scenarios, Bayes’ Theorem is applicable in any domain where you need to update your belief about an event based on new evidence. This includes spam detection in emails, reliability assessments in engineering, A/B testing in software experiments, and beyond. The unifying principle is that you always combine prior knowledge with new evidence to arrive at an updated probability.
Below are additional follow-up questions
If the test is repeated multiple times for the same individual, how can we combine the results in a Bayesian manner, and what are potential concerns?
When a test is administered repeatedly, each additional test result can be treated as new evidence. In Bayesian terms, the posterior from the first test becomes the prior for the second test, and so on. The fundamental step is:
Start with a prior probability, often based on the known prevalence or some refined personal risk factor.
After each test result, update that probability using Bayes’ Theorem.
A key assumption is the independence of test outcomes, conditional on the true disease status. If the tests are not fully independent (for instance, they share common biases or measurement errors), then standard sequential Bayesian updates may overestimate confidence. In practice:
Independence: If the tests rely on the same biological markers or the same methodology, their errors might be correlated. As a result, repeating the test may not provide as much additional evidence as a fully independent test would.
Diminishing Returns: Even if the tests are conditionally independent, as soon as you accumulate sufficient evidence, the probability may converge to near certainty (either very high or very low), and further testing won’t meaningfully change the posterior.
Practical Constraints: Multiple tests might be costly, time-consuming, or impose patient discomfort. There must be a balance between thoroughness and practicality.
Confirmatory Tests: In medical settings, the second or third test is sometimes of a different modality with higher specificity. This approach can mitigate correlation in test errors and yield more reliable Bayesian updates.
How do we handle scenario changes if the test is administered to high-risk groups rather than the general population?
When testing a high-risk group, the disease prevalence (prior) is likely higher than 1%. This modifies (P(D)). A larger prior increases the probability that a positive result reflects a true case, leading to a higher posterior probability (P(D \mid +)). Practical implications include:
Adjusting the Prior: The baseline prevalence is replaced by the prevalence within the high-risk group. This can come from epidemiological data for that subgroup.
Test Thresholds: Because the starting prior is higher, you might rely on different cutoffs or interpret borderline results differently. In some cases, the test might be recalibrated to reduce the false positive rate if the goal is to minimize overtreatment.
Heterogeneous Risk: If the “high-risk” group is still heterogeneous (e.g., patients with multiple comorbidities vs. those with just one risk factor), you may need multiple subgroup-specific priors. Combining them incorrectly can blur the accuracy of the Bayesian update.
In what ways might conditional dependence on additional factors (like age or symptoms) influence the calculation?
Bayes’ Theorem as originally presented uses a single prior for the disease. However, real-world disease likelihood often depends on age, gender, genetics, or symptoms:
Multivariate Priors: Instead of a single (P(D)), you might have a distribution conditioned on these additional factors, such as (P(D \mid \text{age}, \text{family history}, \ldots)).
Conditional Sensitivity and Specificity: Sensitivity could change with age or the presence of certain symptoms. The same test might be more sensitive in symptomatic individuals and less sensitive in asymptomatic ones, or vice versa.
Modeling Complexity: If these factors are correlated, you can adopt Bayesian network models or hierarchical Bayesian frameworks that more accurately capture dependencies. You no longer apply a single test characteristic to all individuals; instead, you refine your test parameters or your prior based on the patient’s subgroup characteristics.
Edge Cases: If certain subgroups are too small (e.g., extremely rare genetic profiles), you may not have enough data to estimate the test’s performance reliably for them. This introduces higher uncertainty in the posterior results.
How might the assumptions behind Bayes’ Theorem break down in real clinical applications, and what are the potential remedies?
Bayes’ Theorem rests on the idea of well-defined probabilities and independence assumptions where appropriate. Common issues include:
Mis-specified Priors: If you incorrectly estimate disease prevalence or risk factors, your posterior probabilities become skewed. Remedy: Regularly update prevalence estimates with recent epidemiological data and consider building robust or hierarchical priors.
Non-Stationary Disease Patterns: A disease’s prevalence may shift rapidly due to new strains, seasonality, or public health interventions. Remedy: Implement dynamic models that allow the prior probability to evolve over time (e.g., state-space models).
Test Condition Variability: A test’s sensitivity/specificity could fluctuate with different lab conditions or operator skills. Remedy: Calibrate test devices across sites, conduct periodic audits, and incorporate a measure of variability in sensitivity and specificity into the Bayesian model.
Violation of Independence: Often, multiple tests or repeated measures are correlated, especially if they use the same methodology. Remedy: Use models that explicitly capture correlation, like Bayesian hierarchical or multi-level models.
Can false negatives be more problematic than false positives in certain situations, and how does Bayes’ Theorem address this aspect?
Yes, a false negative means missing a diseased individual, potentially leading to delayed treatment or further spread of a contagious illness. By contrast, false positives may lead to emotional distress and unnecessary follow-up tests, but not necessarily immediate harm. Bayes’ Theorem itself is neutral; it simply updates probabilities based on provided rates. However, the interpretation and consequences can differ:
Cost Analysis: In many real-world implementations, one weighs the cost of false negatives versus false positives. A test with high sensitivity (few false negatives) but a somewhat higher false positive rate might be acceptable if the disease is serious. For example, you might prefer to incorrectly flag healthy people for further screening rather than miss actual diseased individuals.
Decision-Theoretic Extensions: Bayesian decision theory can incorporate loss functions, where missing a case is assigned a heavier penalty than a false alarm. This leads to an adjusted testing threshold or preference for certain test parameters (like maximizing sensitivity).
Contextualizing Posterior Probabilities: If the posterior probability that someone has the disease is not extremely high, but the cost of missing that case is huge (say, for an extremely contagious and lethal disease), additional confirmatory testing may still be justified.
How do Bayesian credible intervals or intervals of uncertainty apply to the estimated posterior probability?
When working with Bayesian methods, you can derive not only a single point estimate of (P(D \mid +)) but also an interval reflecting the uncertainty in that posterior probability. For instance:
Credible Interval vs. Confidence Interval: Unlike a frequentist confidence interval, a Bayesian credible interval has a direct interpretation: there’s a specified percentage (e.g., 95%) probability that the true probability of having the disease lies within that interval.
Sources of Uncertainty: Uncertainty could come from limited data about the prevalence, uncertainty in sensitivity/specificity, or variations in sub-populations. If you treat these parameters as distributions rather than fixed values, your final posterior probability becomes a distribution as well.
Practical Significance: A wide credible interval might indicate that you don’t have enough data to be confident about the true posterior probability. A narrow interval suggests a more reliable estimate. Clinicians can use that interval when counseling patients about the likelihood of disease and the need for additional testing.
How could healthcare practitioners deal with changing or updated sensitivity and false positive rates as a test evolves over time?
Medical tests can improve or degrade due to hardware updates, refined lab procedures, or changes in reagents. Therefore, sensitivity or false positive rates might not remain static:
Periodic Recalibration: Regularly update the test parameters based on ongoing quality control data. As new data emerges, revise the distributions for sensitivity and false positive rate.
Adaptive Bayesian Methods: In an adaptive framework, new test performance data is consistently integrated, leading to an updated posterior distribution for the test parameters themselves.
Version Control for Tests: Each new version of the test might come with slightly different performance. It’s critical to track which version was used for each patient, so the correct parameters can be applied in the Bayesian update.
Communication to Clinicians: If there is a known drift in test performance, doctors should be informed that the older parameters might no longer be accurate. This transparency ensures correct interpretation of results.
What considerations arise if the cost or risk of the test itself is significant?
When the test is invasive, expensive, or carries its own health risks, a positive test result might not be worth the risk for low-probability cases. Bayes’ Theorem still applies, but practical decision-making balances the expected benefit of accurate detection against the downside of conducting the test:
Pre-Test Risk Assessment: Doctors might administer simpler or cheaper tests first, or use risk questionnaires, to estimate an individual’s risk. Only those above a certain threshold proceed to the more costly or invasive test.
Benefit-Risk Thresholds: Implement a threshold on the prior probability below which the test is deemed unjustifiable. This threshold can be informed by cost analyses or ethical concerns.
Dynamic Protocols: In an iterative approach, a first-tier screening test with high sensitivity but moderate specificity might rule out most low-risk individuals, while only high-risk individuals proceed to the more accurate (but riskier) diagnostic procedure. Each tier is a Bayesian update on the prior.
How might behavioral or psychological factors influence the interpretation of these Bayesian probabilities?
In medical practice, numbers alone do not always translate directly into patient decisions:
Risk Perception: Patients might interpret a 15% chance of having a disease as either trivial or devastating, depending on personal context and health beliefs.
Confirmation Bias: A clinician or patient might hold a strong belief about the presence/absence of disease, skewing the interpretation of test results away from the strict Bayesian posterior.
Communication Challenge: Explaining that a positive test result yields only a 15% chance of having the disease can be confusing. Healthcare providers must frame these probabilities in understandable terms (e.g., “15 out of 100 people with a positive result actually have the disease”).
Informed Consent: Patients often consent to tests without fully grasping the implications of false positives/negatives. This can result in shock or disbelief when subsequent tests contradict an initial result.
Could a conflicting test with different properties override the previous Bayes calculation?
If you have two distinct tests, each with different sensitivity and false positive rates, you can integrate both results:
Sequential Bayesian Updates: Take the posterior from the first test, treat it as the prior for the second test, and incorporate the second test’s likelihood. If the second test strongly disagrees with the first, it can shift the posterior significantly.
Simultaneous Inference: In some cases, you might consider both tests in one unified model. For instance, you have:
(P(+_1 \mid D)), (P(+_2 \mid D)) for the first and second test, respectively.
(P(+_1, +_2 \mid D) ) if they’re correlated.
(P(D)) as the prior.
Then you combine them in a single Bayesian framework. This can be computationally more involved but yields a richer understanding.
Conflict Resolution: If the two tests disagree profoundly, investigate potential reasons: Are the tests measuring different biomarkers? Is one test more prone to user error? Real-world resolution might involve a third, gold-standard test or additional clinical evidence (like patient symptoms, imaging results, etc.).