ML Interview Q Series: Simpson's Paradox: How Changing Product Mix Can Lower Overall Approval Rates

May 26, 2025

Browse all the Probability Interview Questions here.

Capital approval rates dropped from 85% last week to 82% this week, despite each product’s rate remaining the same or going up (Product 1: 84%→85%, Product 2: 77%→77%, Product 3: 81%→82%, Product 4: 88%→88%). What might explain this overall decrease?

Comprehensive Explanation

A common reason for the overall approval rate to go down, even though each product’s approval rate has stayed the same or slightly risen, is a change in the proportion of approvals coming from each product category. This is often linked to Simpson’s Paradox, where combining data from different groups yields a result that appears to conflict with what is observed within each individual group.

When there is a mix of different products, the total approval rate is often a weighted combination of the individual rates. If there was a shift such that the proportion of higher-approving products decreased and the share of lower-approving products increased, the overall weighted approval rate can go down even if every individual product’s approval fraction stayed the same or went up.

Where w_i is the fraction of all applications corresponding to product i, and x_i is the approval rate for product i. If the overall product mix changes in a way that places more weight on products with lower approval rates (or reduces the weight on products with higher approval rates), the overall average can drop.

Shifts in product mix can happen for a variety of reasons. For example, there might have been a marketing push that attracted more applicants to a product with traditionally lower approval rates. Or perhaps a product with historically higher approval rates lost market share. Even slight changes in these proportions can cause the aggregate approval rate to decline.

It is also important to consider that statistical significance might be observed simply because of the size of the dataset or the variability in the data. An actual distribution shift can be detected by segmenting the data by product type and examining application counts. If the volume for a product with relatively lower historical approval rates grew while the volume for a product with higher historical approval rates shrank, the overall approval rate may look like it has decreased despite each product category not seeing a drop on its own.

Potential Follow-up Questions

Why does shifting the proportion of products cause the overall approval rate to go down even if none of the products individually decreased?

When you combine multiple groups, you are essentially creating a weighted average of each group’s metric. If more weight is placed on groups with lower approval rates, even a small increase within those individual groups may not be enough to offset the overall decrease. This phenomenon is precisely what underpins Simpson’s Paradox, where aggregated data trends can differ dramatically from segment-level trends.

How can we detect if a changing product mix caused the decrease?

A good approach is to break the data down by product category and inspect:

The absolute number of applications for each product week over week.
The relative proportion that each product contributes to the total set of applications.
The approval rates themselves within each product and how they have changed over time.

By examining the counts and proportions side by side with approval rates, you will see if the shift in composition is responsible for the overall metric's movement.

Are there any other factors besides distribution shift that could explain the drop?

One possibility is an external factor affecting the entire population, such as stricter underwriting rules that affect all products in a way that is not yet fully captured at the individual product level. Another possibility could be a delayed data recording or out-of-sync timing, making it appear as if approvals decreased when the data for the most recent week is still incomplete. However, the classical explanation, and the most common, is the shift in relative volumes across products.

How might we demonstrate this in Python?

Below is a small illustrative code snippet that simulates two weeks of product distributions and approval rates. Notice how even if the approval rates for each product stay the same or increase, the overall approval can still decrease when the distribution shifts.

import pandas as pd

# Suppose we have 4 products with last week's distribution and approval rates
data_week1 = {
    'Product': ['P1', 'P2', 'P3', 'P4'],
    'Applications': [1000, 1000, 1000, 1000],
    'ApprovalRate': [0.84, 0.77, 0.81, 0.88]
}

# Approvals can be computed as
df_week1 = pd.DataFrame(data_week1)
df_week1['Approvals'] = df_week1['Applications'] * df_week1['ApprovalRate']
overall_week1 = df_week1['Approvals'].sum() / df_week1['Applications'].sum()

# In Week 2, each product's approval rate is the same or slightly higher
# but the distribution of applications has changed
data_week2 = {
    'Product': ['P1', 'P2', 'P3', 'P4'],
    'Applications': [800, 2000, 1200, 600],  # More apps for lower-approving product 2
    'ApprovalRate': [0.85, 0.77, 0.82, 0.88]  # Same or slightly higher rates
}

df_week2 = pd.DataFrame(data_week2)
df_week2['Approvals'] = df_week2['Applications'] * df_week2['ApprovalRate']
overall_week2 = df_week2['Approvals'].sum() / df_week2['Applications'].sum()

print("Overall approval week 1:", overall_week1)
print("Overall approval week 2:", overall_week2)

In this demonstration:

Each product’s approval rate in week 2 is equal to or higher than in week 1.
Despite that, the overall approval can still go down if more applications shift to a product with a lower approval rate (like product 2 in the example).

Could a scenario like this relate to Simpson’s Paradox?

Yes, this is a textbook example of Simpson’s Paradox: when data is partitioned into several groups, each group can show a particular trend, but when the data is aggregated, the direction of the overall trend can reverse or contradict the individual group trends. In this situation, the partition is by product category, and the contradictory trend is the overall approval rate going down while each product’s rate is stable or higher.

How can organizations safeguard against being misled by overall trends?

They can ensure that they:

Always segment the data by meaningful categories when analyzing overall metrics.
Observe changes in both proportions and absolute numbers over time.
Build data dashboards that automatically highlight significant shifts in distribution.

This avoids drawing false conclusions from a single aggregated metric and helps in pinpointing the root cause—such as distribution shifts—before reacting to the seemingly paradoxical drop in the overall approval rate.

Below are additional follow-up questions

What if the observed decrease is due to normal variance in small sample sizes rather than a real trend?

Even when a drop appears statistically significant, one pitfall is overlooking how the sample size in each product category, or for the overall pool, might be too small to conclude a sustained trend. Small sample sizes can produce wide confidence intervals, meaning the week-over-week change could be within the margin of error. In real-world contexts, if certain products (especially those with lower or higher approval rates) had few applicants relative to other products, an uptick or downtick of a few approvals can shift the overall average dramatically.

From a practical standpoint, you would investigate:

The total volume of applications for each product in both weeks.
Confidence intervals for each product-level approval rate as well as the overall approval rate.
Historical volatility to see if a 3% shift is typical or truly an outlier.

In day-to-day business settings, it’s vital to have thresholds for what constitutes a meaningful change. Even if a standard statistical test indicates significance, you must interpret it in light of real-world context. For example, if overall approval rates for a given week are based on a small batch of highly specialized applicants, the observed drop could be purely noise. Data scientists often cross-validate the finding over multiple weeks before concluding there is a meaningful shift.

Pitfalls include:

Overreacting to a single-week variance.
Failing to factor in seasonal or cyclical trends that might produce dips naturally.
Misinterpreting significance if the product volumes are unbalanced (a few large-volume products and many small-volume products).

Could changes in credit policy or external economic factors lead to this paradoxical effect?

Internal policy shifts (such as stricter underwriting criteria) or external factors (like a recession or a policy change by a major lender) can affect applicant quality differently across products. For instance, a more stringent rule might disproportionately affect applicants to one product, but not enough to reduce its approval rate significantly if that product’s baseline acceptance threshold is already high. Meanwhile, the rest of the products might slightly increase their approval rates or remain flat because they are less impacted by the new rule.

This can combine with distribution changes: if new economic conditions make customers gravitate toward products with lower baseline approval rates, you might see the overall metric drop. Even if each product’s own rate is stable (perhaps because any new policy still doesn’t drastically impact them individually), the influx of applicants into the lower-approval product segment pulls down the overall approval.

Real-world edge cases:

Policy changes that only apply to certain types of loans or certain credit segments.
Fluctuating interest rates or promotions that alter how likely a customer is to apply for one product over another.
Macroeconomic shocks that degrade applicant creditworthiness across the board, but not in a uniform way.

How do we distinguish between a distribution-driven decrease and an actual performance drop in each product?

An effective diagnostic is to use a decomposition approach. One straightforward method is to compare the actual overall approval rate in the new week to a hypothetical overall rate if the product mix had stayed the same as the previous week. You can do this by taking last week’s product distribution (in terms of application volumes) and applying the new week’s product-level approval rates to it, then seeing what the resulting overall approval rate would have been.

If that hypothetical overall approval rate remains stable or even shows a rise, then the real observed drop is likely explained by distribution shifts. If, on the other hand, the hypothetical rate also shows a decrease, it indicates that product-level performance (i.e., actual approval outcomes) has also declined in a more fundamental way.

An edge case here is incomplete or delayed data. If certain approvals haven’t yet been recorded for the new week, the computed product-level approval rates might be artificially deflated, making it appear like there’s a bigger shift in performance than there really is. Ensuring data synchronization and thorough data quality checks is critical.

Can sub-segmentation within each product reveal hidden reasons for the overall decline?

Sometimes, even at the product level, there can be underlying segments (for example, different credit score bands, different geographies, or different customer demographics) with varying approval rates. It might happen that within a single product, a higher proportion of riskier applicants appeared in the second week, keeping the product’s overall approval rate roughly the same but increasing the share of rejections. Meanwhile, less risky applicants might have shifted to another product.

To analyze this thoroughly:

Break down each product’s applicants by sub-segments such as credit score brackets or region.
Check if sub-segments with lower approval probabilities grew in volume.
Combine these insights with an overall look at how each product’s application base changed.

Real-world edge cases include:

A marketing campaign targeting new, riskier regions or demographics for a single product.
Seasonal factors causing certain sub-populations to apply more to a specific product.
Minor policy changes targeting niche sub-segments of applicants that only affect certain geographies or credit bands.

Could user behavior or marketing shifts drive more traffic to lower-approval products inadvertently?

User behavior and marketing are major drivers of distribution shifts. Even small changes in promotional materials, product placement on a website, or partner offers can funnel applicants differently:

A newly highlighted product on the main website might attract applicants who are less qualified, even if that product’s typical approval rate is stable.
Certain promotions or advertising campaigns might inadvertently target riskier segments, pushing up the number of applications for products whose typical approval is lower.

Additionally, changes in marketing budgets or strategies that focus on acquiring new users from different channels can shift the applicant pool’s composition. A direct marketing push in a region with historically lower credit scores would boost the volume of applications to certain products, depressing the overall approval if those products are more prone to rejections—even if the product-level approval rate still remains stable.

An edge case is that the marketing team might not realize these shifts unless cross-functional communication is thorough. They might see an overall increase in lead conversions but not realize it’s impacting the quality of applicants for some products. The data science team, upon noticing a paradoxical overall rate dip, should trace back changes in user acquisition strategies to see if that contributed to the difference.

If the total volume stays the same but the overall rate still drops, what might be happening?

It’s tempting to assume that distribution shifts require volume changes per product, but it is also possible that the total number of applications remains roughly constant while the composition changes internally. That is, certain products might see fewer applications while others see more, but the sum is similar.

For example, if product 1 (with a higher approval rate) had a volume decrease that was offset by a volume increase in product 2 (with a relatively lower approval rate), the net total applications remain stable, but the overall approval can decline. This underscores the importance of analyzing not only total volume but also how that total is partitioned.

An additional subtle scenario:

The same applicants might be eligible for multiple products, and small changes in recommendation algorithms or user journeys can steer them differently.
Regulatory or compliance changes might require funneling applicants into certain categories.

Both cases can mask the distribution shift because total volume doesn’t appear to fluctuate dramatically. Yet, behind the scenes, there is a reallocation effect causing the overall approval rate to drop.

How might seasonality or timing create misleading shifts in product applications and approval rates?

Certain times of the year—holidays, end-of-quarter business cycles, or even weather patterns—can produce spikes in one product category without necessarily altering product-level approval rates drastically. Seasonality often overlaps with changes in consumer behavior:

A holiday shopping season might lead to a surge in applications for a product known for funding short-term expenses.
A new year or back-to-school period could boost interest in another product line, shifting distribution again.

If the timing of these patterns is not accounted for, analysts might incorrectly attribute an overall approval drop to performance changes rather than a predictable seasonal spike in the share of applications for a lower-approval product. Edge cases include abrupt external events such as natural disasters or economic disruptions which cause short-lived but significant changes in application patterns that eventually revert to normal.

All these seasonal or timing factors could confound analyses if you only compare week-over-week without recognizing you might be comparing two weeks that are structurally different. A robust solution involves contrasting current trends to historical seasonally adjusted benchmarks, ensuring you’re making like-for-like comparisons.

Rohan's Bytes

Discussion about this post