ML Interview Q Series: Calculating Probability P(X<Y) by Integrating a Joint Density Function

May 28, 2025

Browse all the Probability Interview Questions here.

There are two alternative routes for a ship passage. The sailing times for the two routes are random variables (X) and (Y) that have the joint density function

Connect with me on X (Twitter)

Short Compact solution

Using the general formula for probabilities over regions in the ((x,y))-plane, we compute

[

\huge P(X<Y) ;=;\int_{x=5}^{10}\int_{y=x}^{\infty} \tfrac{1}{10},e^{-\tfrac12,(y+3-x)} ;dy,dx. ]

Performing the inner integration with respect to (y) gives a constant term that does not depend on (x). Integrating over (x) from 5 to 10 then yields

[

\huge P(X<Y) ;=; e^{-\tfrac32} ;\approx; 0.2231. ]

Comprehensive Explanation

Setup of the Problem

We have two random variables (X) and (Y) representing sailing times on two alternative routes. Their joint probability density function (pdf) is given by

This means:

(X) takes values strictly between 5 and 10.
(Y) must be greater than (x - 3). Equivalently, for a fixed (x), the support in (y) starts at (y = x - 3) and extends to (\infty).

We want (P(X < Y)). This event corresponds to the set of all points ((x,y)) such that (x < y). Because the pdf is zero outside the specified region, we only integrate where (5 < x < 10) and (y > x - 3). However, for (X<Y) to hold, we also need (y) to be bigger than (x). So effectively, when we combine (y > x - 3) with (y > x), the stricter condition is (y > x). Hence, the bounds for (y) become (y) from (x) to (\infty), while (x) ranges from 5 to 10.

Setting Up the Integral

The probability (P(X<Y)) is computed by integrating the joint pdf (f(x,y)) over the region ({(x,y),\mid, 5<x<10,; x<y}). In text form:

P(X<Y) = ∫ (x from 5 to 10) ∫ (y from x to ∞) [ (1/10) e^(-1/2 (y + 3 - x)) ] dy dx.

Performing the Inner Integral

Fix (x) in the interval ((5,10)). Consider:

∫ (y from x to ∞) e^(-1/2 (y + 3 - x)) dy.

Factor out e^(-1/2 (3 - x)) to get:

= e^(-1/2 (3 - x)) ∫ (y from x to ∞) e^(-1/2 y) dy.

The integral ∫ e^(-1/2 y) dy = -2 e^(-1/2 y). Evaluating from y=x to ∞ gives:

[-2 e^(-1/2 y)] from x..∞ = -2(0) - (-2 e^(-1/2 x)) = 2 e^(-1/2 x).

Thus, the inner integral is:

e^(-1/2 (3 - x)) * [2 e^(-1/2 x)] = 2 e^(-1/2 (3)) = 2 e^(-3/2).

Notice that this result does not depend on x once we finish the integration with respect to y.

Completing the Outer Integral

We still have the factor 1/10 from the original pdf. So after the y-integration, the expression inside the x-integral is:

(1/10) * [ 2 e^(-3/2) ] = (2/10) e^(-3/2).

We now integrate this constant from x=5 to x=10:

∫ (x=5 to 10) (2/10) e^(-3/2) dx = (2/10) e^(-3/2) * (10 - 5) = (2/10) e^(-3/2) * 5 = e^(-3/2).

Hence we obtain:

Potential Follow-Up Question 1: Why does the integral for y start at y=x and not y=x-3?

Because the pdf specifies y > x - 3, but we also have the additional condition X < Y, which is the same as y > x. Between the two inequalities (y > x - 3) and (y > x), the condition y > x is the stricter one if x > 3. Since x always exceeds 5 in this problem, x-3 is at most 7, which is less than x. Hence the region for X < Y essentially imposes y >= x (which is bigger than x-3). Therefore, we integrate y from x to ∞.

Potential Follow-Up Question 2: Are X and Y independent?

No, they are not. If they were independent, the joint pdf would factorize into a product of a function of x times a function of y. Here, the pdf has the term e^(-1/2 (y + 3 - x)) with a constraint y > x-3, clearly tying x and y together. Independence would allow y to take all possible values regardless of x, but here y must be greater than x-3. Hence, X and Y are dependent.

Potential Follow-Up Question 3: How would we compute P(X > Y)?

We could follow a nearly identical procedure but integrate over the region x>y. We would set up:

P(X>Y) = ∫ (x=5..10) ∫ (y=-∞..x) f(x,y) dy dx,

but constrained also by y > x-3 (from the pdf’s support). In that region, y ranges from max(x-3, -∞) to x. Because x≥5, x-3≥2, so effectively y goes from x-3 to x. The integrals would yield a result different from e^(-3/2). In fact, P(X>Y) + P(X<Y) + P(X=Y) = 1 (neglecting the measure-zero event X=Y in continuous distributions). Thus, P(X=Y) = 0 for continuous random variables, so we would have P(X>Y) = 1 - e^(-3/2).

Potential Follow-Up Question 4: How might we approach this problem with a numerical integration approach?

In Python, we could use a numerical integration library (e.g., scipy.integrate.dblquad). We would define the function for f(x,y) in code and integrate over x from 5 to 10, and for each x, integrate y from x to ∞. Because the tail goes to ∞, we would choose a sufficiently large upper limit or apply an appropriate numerical integration approach that can handle infinite bounds. For instance:

import numpy as np
from scipy.integrate import dblquad

def f_xy(y, x):
    if 5 < x < 10 and y > x-3:
        return 0.1 * np.exp(-0.5 * (y + 3 - x))
    else:
        return 0.0

# Probability X < Y
def region_y_lower(x):
    return x  # y from x to infinity

def region_y_upper(x):
    return np.inf

prob_estimate, err = dblquad(f_xy, 5, 10, region_y_lower, region_y_upper)
print(prob_estimate)

We would expect to see a result extremely close to np.exp(-1.5).

Potential Follow-Up Question 5: Any intuitive interpretation for the result?

The constant factor and the exponential decay come from the assumption that as y increases, the probability density decreases at a certain rate. The region 5<x<10 restricts X, while Y must be larger than x-3. Because we only want X<Y, effectively we are capturing the portion of the exponential tail above y=x. The final number, e^(-3/2), is roughly 0.2231, indicating that the chance of route X taking less time than route Y is about 22.31% under this model.

Below are additional follow-up questions

If the joint distribution were not exponential-like or if the support region changed, how would that affect the approach?

To handle a more general joint density, you would still follow the same conceptual steps: identify the region in the (x,y)-plane where the event X<Y holds and where f(x,y) is non-zero, then set up the corresponding double integral. The primary difference would be in how you perform the integrations and how the bounds are determined. If the support region changes (for instance, if X could take values in a different interval, or if Y had an upper limit instead of infinite range), you would adjust the integration limits accordingly.

A subtlety arises if the region where X<Y overlaps partially with another boundary condition for Y. If, for example, the new distribution only supports y in a specific finite interval, you must ensure that the y bounds for X<Y do not exceed that finite range. This can introduce piecewise integrals if the boundary conditions cause the region of integration to break into multiple subregions.

How would you use a change of variables or transformations to simplify the integral?

Sometimes, rewriting the variables can make the integral simpler. For example, you could define:

u = x
v = y - x

Then y = x + v, and the Jacobian determinant for the transformation is 1. The region X0, and the original support constraints translate into constraints on u and v. You would rewrite f(x,y) as f(u, u+v) in the new coordinates, and then integrate over u from 5 to 10 (or whatever the new range is after adjusting for y>x-3 in terms of u and v), and v from 0 to something that depends on u.

A potential pitfall here is incorrectly determining the new bounds in terms of v>0 versus v> -3, or mixing up the sign of the exponent. You must carefully check how the support condition y>x-3 translates to v> -3.

Could the event X=Y have non-zero probability if we extend or modify the distribution?

Under most continuous distributions (like the exponential-based joint distribution given), the event X=Y has zero probability because a single line in a continuous two-dimensional domain has measure zero. However, in some real-world shipping scenarios, route times might be modeled with discrete distributions (e.g., integer number of days). In a discrete or mixed continuous-discrete case, P(X=Y) can be non-zero if there is a positive probability that both X and Y take the same discrete value.

A subtlety in real-world data is partial discretization: for instance, route times are often recorded in entire days or half-days. This partial discretization might be accounted for by a continuous distribution with a strong peak at certain integer values. Then X=Y might still theoretically have probability zero in a purely continuous sense, but approximate computations on binned data might yield non-trivial mass at X=Y. Interviewers sometimes ask how you handle these corner cases when data is not purely continuous.

What happens if real shipping times are truncated or censored?

In practice, shipping times cannot exceed certain operational limits or might be left-censored if data collection started after a specific date. If the distribution is truncated above a certain point, you must re-normalize the pdf so that it integrates to 1 over the truncated region. Then the integral for P(X<Y) would have upper limits set by this truncation (say, Y cannot exceed 20 days).

Censoring changes the approach similarly. For instance, if any sailing time above 15 days is reported only as "15+ days," you do not know the exact sample values beyond 15 days, so you must handle partial or total censoring. Methods from survival analysis or Bayesian updating might be used to estimate the distribution. A pitfall is forgetting to account for the missing or partially observed data and still applying the formula as if full data were available.

How would you interpret and manage correlated extremes in real-world data?

If X and Y represent routes that share environmental conditions (e.g., storms, port congestion), they could be heavily correlated in extreme cases. The simple parametric form given in the question might underestimate the joint probability of both routes having very long times simultaneously. In extreme value theory, tail dependence can be stronger than the average correlation suggests.

When dealing with correlated extremes, one might switch to a different joint distribution that accommodates heavy tails (like certain copula-based models or other heavy-tailed distributions). The integration approach for P(X<Y) still applies, but the functional form of f(x,y) can be more involved, requiring numerical methods or approximations. A pitfall is ignoring that in the real world, if route X experiences delays, route Y might also experience delays (e.g., due to the same weather system).

How would you estimate parameters for this model from real data, and what pitfalls might arise?

Parameter Estimation
- You might assume f(x,y) has an exponential structure and estimate parameters via maximum likelihood. Concretely, if you propose that the joint density is proportional to exp(-1/2(y + 3 - x)) within the region, you would gather data pairs (x_i, y_i) for i=1..n, form the log-likelihood, and differentiate to find the parameter(s).
- If the data strongly deviates from the assumed structure, you could get biased or inconsistent estimates.
Pitfalls
- Overfitting: If you try to incorporate too many parameters or a more complex joint distribution without sufficient data, you risk overfitting.
- Model misspecification: If the real relationship is not truly exponential-based, forcing that functional form could yield systematically biased results for P(X<Y).
- Data Quality: Real shipping data might have outliers, missing records, or rounding. This can distort parameter estimates unless carefully handled.

How does numerical instability occur in the integration process, and how can we mitigate it?

When integrating functions of the form exp(-1/2(y + something)), you might encounter floating-point underflow if y becomes large, especially in languages or libraries with finite precision floating-point arithmetic. This can cause the integrand to evaluate to zero numerically even though it is not theoretically zero.

To mitigate this, you can:

Use log-transformations during integration (i.e., log-sum-exp techniques).
Employ libraries that perform adaptive quadrature or that handle integrals over infinite ranges with specialized transformations.
If you anticipate large exponents, re-center or scale the integrand so that it remains within a numerically stable range.

A pitfall is not paying attention to small floating-point values and accidentally concluding the integral is zero for large y-range. This can lead to large errors in the computed probability.

What if we wanted P(X<Y) but we suspect that X and Y might be swapped in some scenarios?

In some real-world contexts, you might not know which route time is X and which is Y. Or you might suspect that the data labeled “X route” is sometimes used by a different shipping company or the route designations occasionally got reversed in the data. If your labeling of X vs. Y is ambiguous, you may have to consider a mixture or data-likelihood approach that accounts for the probability of labeling error.

A subtle modeling approach might be to define latent variables indicating which route is which in each observation. This can lead to a more complicated inference problem requiring iterative methods like EM algorithms. Failing to account for such label-swapping in the data can result in incorrect estimates of P(X<Y).

How would you validate or sanity-check that e^(-3/2) is approximately 0.2231?

A quick numerical check:

Evaluate exp(-1.5) in most programming languages or calculators: In Python: import math; math.exp(-1.5) This should give ~ 0.22313016.

A deeper sanity check involves verifying that the exponent parameter (1/2) and the shift (3-x) were integrated correctly. Even a small arithmetic error in the exponent or factor of 1/2 could significantly change the result. Another way is to approximate the integral by subdividing the domain and doing a Riemann sum. If the result matches the analytical expression, that’s a good indication your derivation is correct.

Rohan's Bytes

Discussion about this post