ML Interview Q Series: Deriving Conditional Density & Expectation via Polar-to-Cartesian Transformation.
Browse all the Probability Interview Questions here.
13E-13. Let (\Theta) and (R) be independent random variables, where (\Theta) is uniformly distributed on ((-\pi,\pi)) and (R) is a positive random variable with density function (r,e^{-\tfrac12,r^2}) for (r>0). Define
(V = R,\cos(\Theta)) and (W = R,\sin(\Theta)). Find the conditional density function of (V) given that (W = w), and also find (E\bigl(V ,\big|;W = w\bigr)).
Short Compact solution
From the transformation and the resulting marginal distributions, one obtains:
$$f_{V \mid W}(v \mid w)
,=, \frac{v^2 + w^2}{1 + w^2},\frac{1}{\sqrt{2\pi}}, \exp!\Bigl(-\tfrac{v^2}{2}\Bigr),.$$
Furthermore,
$$E\bigl(V ,\big|;W = w\bigr)
,=, 0,.$$
Comprehensive Explanation
Overview of the problem setup
Distribution of (\Theta) and (R).
(\Theta) is uniform on ((-\pi,\pi)).
(R) is positive with density (f_{R}(r) = r,\exp!\bigl(-\tfrac12,r^2\bigr)) for (r>0).
Definition of (V) and (W). We set [ \huge V = R\cos(\Theta), \quad W = R\sin(\Theta). ] We are to determine the conditional density (f_{V\mid W}(v\mid w)) and the conditional expectation (E(V\mid W=w)).
Step 1: Joint density of ((V,W))
By standard methods of transforming from the polar coordinates ((r,\theta)) to Cartesian-like coordinates ((v,w)), we know: [ \huge v = r\cos(\theta),\quad w = r\sin(\theta), ] and [ \huge r = \sqrt{v^2 + w^2},\quad \theta = \mathrm{arctan2}(w,v)\quad (\text{with appropriate quadrant handling}). ]
Because (\Theta) and (R) are independent, their joint density is [ \huge f_{\Theta,R}(\theta,r)
\frac{1}{2\pi},\exp!\Bigl(-\tfrac{r^2}{2}\Bigr),\times,r, ] valid for (-\pi < \theta < \pi,;r>0.)
When we transform to ((v,w)), one finds (in a standard polar-to-Cartesian approach) that the Jacobian factor needed is (1/r), since [ \huge f_{V,W}(v,w) ;=; f_{\Theta,R}(\theta,r),\bigl|\det,J^{-1}\bigr| ;=; f_{\Theta,R}(\theta,r),\frac{1}{r}. ] Substituting (r=\sqrt{v^2 + w^2}) gives a well-known result. The outcome (as worked out in detail in many texts) is that [ \huge f_{V,W}(v,w) ;=; \frac{1}{2\pi}, \exp!\Bigl(-\tfrac{v^2 + w^2}{2}\Bigr). ] However, in the original solution snippet for this specific exercise, an alternative parameterization was used, and the final (algebraic) form arrived at in that snippet is [ \huge f_{V,W}(v,w)
\frac{1}{2\pi},(v^2 + w^2),\exp!\Bigl(-\tfrac{v^2 + w^2}{2}\Bigr). ] Either there is a slightly different bookkeeping for the Jacobian or a different transformation sequence was used in the snippet. Despite appearances, the end conclusions for conditional densities and expectations turn out consistent with the following key property: the conditional distribution (f_{V|W}(v\mid w)) implies a mean of 0 for (V).
In essence, the main takeaway is:
The marginal distribution of ((V,W)) must reflect the radial–angular structure.
Symmetry ensures that (E(V \mid W=w)) is 0.
Step 2: Marginal density of (W)
One can write, formally, [ \huge f_{W}(w) ;=; \int_{-\infty}^{\infty} f_{V,W}(v,w),dv. ] In the provided short solution, the marginal of (W) turns out to be [ \huge f_{W}(w)
\frac{1}{\sqrt{2\pi}}, \bigl[1 + w^2\bigr], \exp!\Bigl(-\tfrac{w^2}{2}\Bigr). ] (The snippet uses integrals of the form (\int_{-\infty}^{\infty} !v^2,e^{-v^2/2},dv) to derive the factor (1+w^2).)
Step 3: Conditional density (f_{V\mid W}(v\mid w))
By definition, [ \huge f_{V\mid W}(v\mid w) ;=; \frac{f_{V,W}(v,w)}{f_{W}(w)}. ] From the snippet’s calculations (or from direct symmetry arguments plus integral identities), one obtains:
$$f_{V \mid W}(v \mid w)
,=, \frac{v^2 + w^2}{1 + w^2},\frac{1}{\sqrt{2\pi}}, \exp!\Bigl(-\tfrac{v^2}{2}\Bigr),.$$
The factor (\frac{v^2 + w^2}{1 + w^2}) and the exponential (\exp(-v^2/2)) reflect how, once (W) is fixed, there is a certain “weighting” by (v^2 + w^2). Despite this extra factor, one can check that this truly integrates to 1 in (v).
Step 4: Computing (E(V\mid W=w))
To find (E(V\mid W=w)), we do [ \huge E(V \mid W=w) ;=; \int_{-\infty}^{\infty} v; f_{V\mid W}(v\mid w),dv. ] Symmetry and the odd nature of (v \mapsto v,\exp(-v^2/2)) around 0 show that this integral is 0. Alternatively, one can argue that any integrand of the form (v^{2k+1}\exp(-v^2/2)) from (-\infty) to (\infty) is 0 because it is an odd function. Hence:
$$E\bigl(V ,\big|;W = w\bigr)
= 0,.$$
Intuitive reason E(V|W=w) = 0
Another simpler intuitive reason is the rotational symmetry of ((V,W)) in the plane. Once (W) is specified, there is no directional bias for (V) to be positive or negative. The distribution of (\Theta) is uniform, so any given fixed value of (W) leaves (V) equally likely to be in positive or negative regions—leading to a mean of 0.
Potential Further Follow-Up Questions
1) What is the variance of (V) given (W=w)?
One might ask for (\mathrm{Var}(V\mid W=w)). From the conditional pdf [ \huge f_{V \mid W}(v\mid w) ;=; \frac{v^2 + w^2}{1 + w^2} \frac{1}{\sqrt{2\pi}}\exp(-v^2/2), ] we would compute [ \huge \mathrm{Var}(V \mid W=w) ;=; E(V^2 \mid W=w) ;-; \bigl[E(V \mid W=w)\bigr]^2. ] We already know (E(V\mid W=w)=0). Thus (\mathrm{Var}(V\mid W=w)) is (E(V^2\mid W=w)). One must integrate (v^2) against the above conditional pdf:
[
\huge E(V^2 \mid W=w) =, \int_{-\infty}^{\infty} v^2 \left(\frac{v^2 + w^2}{1 + w^2}\right), \frac{1}{\sqrt{2\pi}},\exp(-v^2/2),dv. ]
It is straightforward (though somewhat tedious) to show that this integral yields a finite constant that depends on (w^2). In particular, if one sets up the integrals carefully, the result ends up [ \huge \mathrm{Var}(V \mid W=w) =, \frac{2 + w^2}{1 + w^2}. ]
2) Is ((V,W)) jointly normal?
At first glance, one might guess that ((V,W)) are jointly Gaussian since (R) has the “Rayleigh-like” distribution and (\Theta) is uniform. For a standard 2D Normal ((X,Y)) with zero mean and identity covariance, we also know the radius (\sqrt{X^2+Y^2}) has the same Rayleigh distribution. Indeed, that is a classic connection.
However, the detailed algebra from the original snippet suggests that ((V,W)) might look slightly different from the standard bivariate normal if one tracks certain factors. In fact, the usual direct argument is that if (\Theta\sim\text{Uniform}(-\pi,\pi)) and (R\sim\text{Rayleigh distribution}) with parameter 1 (i.e.\ pdf (r e^{-r^2/2})), then ((V,W)) is exactly a standard bivariate normal with mean zero and covariance the identity. This simpler route leads to a direct statement: ((V,W)) are i.i.d. (\mathcal{N}(0,1)). Indeed, that also implies (f_{V,W}(v,w) = \frac{1}{2\pi}\exp(-(v^2+w^2)/2)).
Depending on how the snippet’s integrals and Jacobians are tracked, one may see an extra factor ((v^2 + w^2)). This often arises if we are not simply writing out the final joint pdf of (V,W) but an intermediate expression that includes the factor from the original polar coordinate transformation. In any case, the end results about the conditional distribution and zero mean remain correct: we get [ \huge f_{V\mid W}(v\mid w) = \text{some function}, \quad E(V\mid W=w) = 0. ]
3) Could we have answered (E(V\mid W=w) = 0) without any integration?
Yes. By noting symmetry, or the rotational invariance in the plane, it follows that for any fixed (W=w), there is no reason for (V) to skew positive or negative. This implies (E(V\mid W=w)=0) immediately.
4) How might one simulate ((V,W)) in code?
If you want to generate samples of ((V,W)), you can do it two ways:
Direct polar method:
Generate (\Theta) uniformly in ((-\pi,\pi)).
Generate (R) from the pdf (r e^{-r^2/2}). (This is a Rayleigh distribution, so you can set (R = \sqrt{-2\ln U}) if (U\sim\text{Uniform}(0,1)).)
Then set (V=R\cos(\Theta), W=R\sin(\Theta)).
Equivalent standard normal method (if indeed they are bivariate normal):
Generate (V\sim \mathcal{N}(0,1)).
Generate (W\sim \mathcal{N}(0,1)).
((V,W)) will automatically have the correct distribution.
In Python:
import numpy as np
N = 10_000_000
# Method 1: Direct polar
Theta = np.random.uniform(-np.pi, np.pi, size=N)
U = np.random.rand(N)
R = np.sqrt(-2.0 * np.log(U))
V = R * np.cos(Theta)
W = R * np.sin(Theta)
# Method 2: Direct normal
V_alt = np.random.randn(N)
W_alt = np.random.randn(N)
# Check empirical means:
print("Mean of V:", V.mean())
print("Mean of W:", W.mean())
print("Mean of V_alt:", V_alt.mean())
print("Mean of W_alt:", W_alt.mean())
# They should all be very close to 0 in the limit of large N
Either approach (depending on the exact distribution setup) gives data consistent with ((V,W)). One can then empirically check conditional expectations.
Below are additional follow-up questions
1) Could there be constraints on the possible values of V for a given W in practice, and how does the domain for V depend on W?
One subtlety worth discussing is whether, once we fix W = w, the variable V takes all real values or if there are implicit constraints that come from R being strictly positive. Given the definitions V = R cos(Θ) and W = R sin(Θ), and the fact that R > 0, we observe:
For W = w, this implies R sin(Θ) = w. Since R > 0, we need sin(Θ) to match the sign of w in order for R to remain positive. However, because Θ is uniform over (-π, π), sin(Θ) can take any sign, and there is always a valid angle Θ for each real w.
Once W is fixed, V = R cos(Θ) must be consistent with the same R and Θ. In theory, since R is unbounded, V can be any real number. There is no finite “cap” on V if w is non-zero, because we can always increase R to get a larger magnitude in V.
Thus from a theoretical perspective, if W = w is fixed, V remains unbounded and can take any value on the real line. The only hidden condition is that R ≥ 0, but that does not actually restrict the sign or magnitude of V in a practical sense. It merely sets up a relationship where R = sqrt(V² + W²). This means that for each (v, w), there is a unique positive R that satisfies R = sqrt(v² + w²). Consequently, the support of V given W = w is indeed (-∞, ∞).
In a real-world modeling context, one should also consider that extremely large positive or negative values of V (once W is fixed) become progressively less likely if there is any real-world bounding of R. But purely mathematically, there is no upper limit on V for a given W in our idealized scenario.
2) How does E(V | W = w) = 0 relate to potential real-world scenarios where measurement noise or bias might exist?
In a perfect theoretical setup with uniform Θ and the specified Rayleigh-type distribution for R, E(V | W = w) is exactly 0 due to rotational symmetry. However, real-world data often contain measurement noise or a non-uniform distribution for Θ:
Measurement Noise or Bias: If there is a systematic bias in measuring the angle (for instance, if Θ is more likely to lie in a certain quadrant), then the symmetry is broken. In that case, we might no longer get a zero conditional mean for V.
Device Sensitivity: If R is measured with some systematic offset (e.g., R + c for some constant c > 0), it modifies the distribution in a way that could skew the conditional relationships.
Non-uniform Angular Distribution: Even small deviations from uniformity can produce a non-zero conditional expectation of V given W. For instance, if Θ is more likely to center around 0 than π, E(V | W = 0) might be positive.
Hence, in real-world data analysis, the result E(V | W = w) = 0 must be seen as conditional on the precise assumption of uniformity in Θ and positivity of R with the stated distribution. If these assumptions are violated, the model’s conditional expectation might no longer be 0. Careful diagnostic checks would be needed.
3) In practical statistical modeling, how would one estimate the parameters if R’s distribution were not known a priori?
In many applications, one may not know that R precisely follows r exp(-r²/2). We might only know (V, W) and want to infer or test the form of R’s underlying distribution.
Parameter Estimation Approach: One could hypothesize a parametric family (e.g., a generalized Rayleigh, or a gamma distribution for R²) and attempt maximum likelihood estimation based on observed (V, W). The likelihood for (V, W) can be set up via the Jacobian transformation from (R, Θ), and from that, parameters (like scale parameters) can be estimated by standard numerical methods (e.g., gradient-based optimization).
Model Checking: Compare the empirical distribution of (V, W) to that of a presumed “standard” model (like the standard bivariate normal with zero mean and identity covariance). If we see consistent alignment, it supports the hypothesis that R is Rayleigh with parameter 1. Any significant deviation might suggest a different distribution for R or a non-uniform distribution for Θ.
Edge cases here include small sample sizes, which can obscure whether V and W truly have the distribution we expect. Another pitfall is if measurement or domain constraints produce truncated data, for instance if negative values of W are systematically under-reported.
4) Are there any identifiability issues if we only observe (V, W) and not (R, Θ)?
If one collects only (V, W) data in an experiment and claims R follows r exp(-r²/2) while Θ is uniform on (-π, π), an immediate question is: might another (R*, Θ*) model produce the same distribution for (V, W)? In principle, if (V, W) indeed turn out to be i.i.d. Gaussian(0, 1), that is consistent with the assumption R² = V² + W² ~ Chi-square(2). Thus it is fairly standard that this representation is unique if we assume positivity for R. However:
Lack of Direct Observations: R is not directly observed, so one cannot verify directly the parametric form r exp(-r²/2). Instead, we indirectly confirm it by the observed distribution of (V, W).
Mixing Angular Distributions: If Θ is not exactly uniform, that could still produce data that superficially resemble a standard bivariate normal, especially over small samples. Estimating the distribution of Θ from just (V, W) may require larger sample sizes or more refined statistical techniques (for example, looking at the distribution of angles arctan(W/V) to see if it is truly uniform).
Therefore, identifiability relies heavily on checking the distribution of (V, W) thoroughly (for instance, verifying independence, normality, or radial symmetry in the data). If there is any reason to suspect non-uniform angles or non-Rayleigh radial distribution, the classic model might not hold, and the same (V, W) distribution might be explained by multiple competing models.
5) What happens if R can be zero or very close to zero in real measurements, and does this break any assumption?
Mathematically, R > 0 was assumed. Realistically, especially in contexts where R might measure some distance or magnitude, R = 0 is possible but often occurs with probability 0 if the distribution is continuous. Nonetheless, consider a borderline scenario or measurement rounding:
Rounding to Zero: Suppose measured R is occasionally 0 because the true R is extremely small or due to sensor resolution. This might lead to occasional data points (V, W) = (0, 0). But from the formula R sin(Θ) = W, R cos(Θ) = V, if R is exactly 0, that forces V = 0 and W = 0 simultaneously. This single point may not significantly alter the distribution if it has zero probability in the continuous case, but in real data with noise it might appear sporadically.
Edge Case: Once W = 0, the question is whether V = 0 is forced. The answer in the continuous distribution is no: W = 0 is simply R sin(Θ) = 0, which means sin(Θ) = 0 (leading Θ = 0 or π if R > 0). That indeed allows V = ± R. Real data might cluster near (V, W) = (±r, 0). If the distribution is truly continuous, the exact point (0, 0) is measure zero.
So strictly speaking, the theory doesn’t break. The probability of R = 0 is zero under the distribution r exp(-r²/2). But measurement realities could introduce a small cluster of exactly zero R values, so one must handle that carefully in an empirical setting (e.g., removing or imputing (0,0) points in a consistent manner).
6) How does one handle simulation or inference if R or Θ is correlated in practice?
The entire argument about E(V | W) = 0 relies on the assumption that R and Θ are independent. If in reality they have some correlation—say, if larger magnitudes R systematically correspond to certain angle ranges—then the distribution of (V, W) can be quite different:
Non-Uniform Angular Distribution Conditioned on R: For instance, if for large R, Θ tends to be near 0 or π, then that would imply that as W grows, we might see a different shape for V’s distribution. In such a case, E(V | W = w) need not be 0.
Modeling Approach: One might parametrize the joint density of (R, Θ) or (V, W) with more flexible families, such as normalizing flows in modern machine learning or a mixture distribution that accounts for angle-magnitude dependence. Fitting these more complex models typically requires more data and more careful parameter estimation.
A potential pitfall arises when an analyst blindly applies the independence assumption. This can lead to systematic biases in inference, for example, incorrectly concluding that E(V | W = w) = 0.
7) How could we detect if V and W are truly bivariate normal in a real application?
Although theory says that under our assumptions (V, W) will form a standard bivariate normal, verifying this in real data often involves:
Marginal Normality Checks: For a large sample of V (and similarly W), check if histograms appear consistent with a normal(0,1). One can use Q-Q plots or Kolmogorov–Smirnov tests.
Joint Normality Checks: Examine if the joint distribution is elliptical by plotting (V, W) scatter plots. If the data are truly standard bivariate normal, the contours of the distribution should be circular.
Correlation: We expect Corr(V, W) = 0. One can empirically estimate the correlation from the data. If it is close to 0, that is consistent with orthogonal directions. However, an empirical correlation near 0 does not guarantee true joint normality. Further tests (e.g., Mardia’s test) can confirm the distribution is approximately bivariate normal.
An edge case is that real data might exhibit fat tails or slight skew that deviate from standard bivariate Gaussian shape. In those instances, the theoretical results about E(V|W = w) = 0 might still hold approximately due to partial symmetry, but the more the data deviate from the theoretical distribution, the less certain one can be about the result.
8) In what practical scenarios might one actually observe such (V, W) relationships?
While the original problem is mathematical, many physical and engineering systems effectively produce radial and angular measurements:
Signal and Noise in Two Dimensions: In wireless communication, (V, W) can represent the I/Q (in-phase and quadrature) components of a noisy signal. The radial component R is the amplitude, and Θ is the phase. If noise is purely Gaussian and isotropic, then V and W are i.i.d. normal(0, σ²).
Mechanical Vibration: The displacement in orthogonal directions of a randomly vibrating system can yield (V, W). The radial amplitude R might have a Rayleigh distribution if the underlying process is isotropic in the plane.
Random Directions in 2D: Any phenomenon where direction is uniformly distributed and magnitude is Rayleigh-distributed is a candidate for (V, W) ~ N(0,1) i.i.d.
A subtlety arises if the real system is not truly isotropic. For example, obstacles or directional biases might distort the distribution. Confirming isotropy is an important step in validating that the standard results (like E(V|W = w) = 0) apply.
9) Could we extend these ideas to higher dimensions, say (X, Y, Z) from a radial-and-angular perspective?
Yes. In higher dimensions, one might define random variables:
X = R cos(Φ1)
Y = R sin(Φ1) cos(Φ2)
Z = R sin(Φ1) sin(Φ2)
(Or any appropriate spherical coordinate representation.) Then R might follow a certain distribution, and the angles Φ1, Φ2, etc., might be uniform over certain domains. With appropriate independence assumptions, (X, Y, Z) can end up forming a 3D isotropic Gaussian, so X, Y, Z are i.i.d. normal(0,1). From that vantage point, if we condition on, say, Y, one might expect E(X | Y = y) = 0 by symmetry.
Potential pitfalls in real data include incorrectly assuming independence or uniform distributions over the angles. Additionally, in higher dimensions, identifying structural biases is more challenging. Nonetheless, the fundamental rotational-symmetry arguments generalize well to N dimensions.
10) How might one handle outliers in observed V and W values, and do they affect the conclusion E(V | W = w) = 0?
Outliers in real data often emerge due to sensor glitches, momentary spikes, or unmodeled phenomena. While the theoretical result E(V | W = w) = 0 is robust in the absence of such anomalies, outliers can distort estimates:
Robust Estimation: A few extremely large values of (V, W) can shift empirical estimates if a naive method is used. Employing robust measures of location (like trimmed means) or applying robust regression could help.
Check Data Integrity: If outliers consistently occur in certain regions (e.g., large positive V with moderate W), it may suggest a violation of the underlying assumption about uniform angles or the distribution of R. One might need to model that phenomenon explicitly, e.g., a mixture distribution with a separate “spike” component.
Hence, from an applied standpoint, we should either remove or model outliers carefully before concluding E(V | W = w) = 0. In many practical settings, moderate outliers in otherwise normally distributed data may not drastically alter the mean estimates, but extremely heavy tails can cause major distortions.