ML Interview Q Series: If your cost function involves a Monte Carlo approximation (e.g., in variational inference or policy gradients), how do you reduce variance in the gradient estimates?

Mar 29, 2025

📚 Browse the full ML Interview series here.

Hint: Techniques like baselines, control variates, or low-variance reparameterization

Comprehensive Explanation

When we use Monte Carlo methods in estimating gradients, the core challenge is often dealing with high-variance estimates. Monte Carlo approximations rely on sampling from a (potentially complex) distribution, and especially when our gradient depends on random variables, the variability in those samples can inflate the variance of the gradient estimate. High variance, in turn, slows or destabilizes learning. Several standard strategies exist to mitigate this:

Connect with me on X (Twitter)

Baselines and control variates are two conceptually related ideas that reduce variance by subtracting out (or adding) carefully chosen terms whose expectation is zero. The reparameterization trick changes how the randomness enters the computation so that the gradient operator can pass inside the expectation more directly. Each approach attempts to reduce variance without introducing much (or any) bias.

Monte Carlo Gradient Estimation

A commonly encountered Monte Carlo gradient estimation scenario arises in policy gradient methods (REINFORCE) and in variational inference. For instance, in policy gradient methods, you might see an update rule of the form:

Here,

J(theta) is the objective function (for example, expected reward in reinforcement learning or the negative of the cost in a variational inference context).
p_theta(x) is a distribution parameterized by theta (often describing trajectories, actions, or latent variables).
R(x) is a reward (or sometimes a cost/negative log-likelihood) function that depends on the sampled variable x.

The expectation is often approximated by sampling x from p_theta(x). Direct Monte Carlo estimation of this gradient can lead to high variance, especially when R(x) is large or the distribution p_theta is wide.

Baselines (Control Variates)

One way to reduce variance is to subtract a baseline term b that does not depend on x but may depend on the current state of parameters theta. The idea is that:

R(x) - b

on average behaves similarly to R(x) alone in expectation, but can reduce variance significantly if b is a good approximation of R(x) in general. A commonly chosen baseline is a learned value function in reinforcement learning, or an empirical average of the samples in some cases.

In policy gradient terms, you might incorporate a baseline as:

R(x) - b(x)

where b(x) is some baseline that does not depend on the exact sampling path. One subtlety is that if you use a baseline that depends on x, you have to ensure it does not break the gradient’s unbiased property. In many reinforcement learning approaches, b(x) is the “value function,” estimated by a separate function approximator and carefully used in a manner that preserves the unbiased gradient estimate.

Why Baselines Help

By subtracting a quantity that is close to R(x), the difference R(x) - b(x) has smaller variance than R(x) alone. Since only the difference appears in the gradient formula, the expected value of the gradient remains the same (b(x) is typically chosen so that its gradient or expectation has no net effect), but the variance can be significantly reduced.

Control Variates

Control variates generalize the baseline idea. A control variate is a function f(x) for which you already know the expected value. In practice, the difference:

R(x) - alpha * (f(x) - E[f(x)])

has a smaller variance than R(x) alone if alpha is chosen appropriately (for instance via regression). In reinforcement learning, the baseline is a special case of a control variate, where f(x) = 1 so that E[f(x)] = 1. For more complex distributions and scenarios, you might design or learn more sophisticated control variates.

Reparameterization Trick

Another major technique to reduce variance is the reparameterization trick, often used in variational autoencoders or other differentiable latent-variable models. Instead of sampling x directly from p_theta(x), you rewrite x as a deterministic function of theta and an independent noise epsilon. A common example in a Gaussian setting is:

z = mu(theta) + sigma(theta) * epsilon

where epsilon is sampled from a standard Normal(0, 1). This reparameterization makes the random variability come only from epsilon, which is independent of theta. The function mu(theta) + sigma(theta)*epsilon is differentiable w.r.t. theta, so you can move the gradient inside this transformation. This typically yields significantly lower variance than a score-function-based approach (like REINFORCE) because the gradient w.r.t. theta is more direct.

Practical Considerations

In practice, we often combine multiple variance reduction techniques. For example, in a variational inference setting with a Gaussian variational distribution, we use the reparameterization trick to get a low-variance gradient estimate of the ELBO (Evidence Lower BOund). Meanwhile, in policy gradient methods, we typically incorporate baselines (value function approximators) and possibly other advanced tricks like Generalized Advantage Estimation (GAE) to further reduce variance in the gradient.

Below is a small conceptual PyTorch snippet illustrating a policy gradient step with a simple baseline:

import torch
import torch.nn as nn
import torch.optim as optim

# Suppose we have a policy network that outputs action probabilities
class PolicyNetwork(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        # Output a logit for each possible action
        return self.fc2(x)

policy_net = PolicyNetwork(input_dim=4, hidden_dim=128, output_dim=2)
optimizer = optim.Adam(policy_net.parameters(), lr=1e-3)

# Some baseline network (e.g., value function approximator)
class BaselineNetwork(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(BaselineNetwork, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)

baseline_net = BaselineNetwork(input_dim=4, hidden_dim=128)
baseline_optimizer = optim.Adam(baseline_net.parameters(), lr=1e-3)

# We gather some transitions from the environment (dummy example):
observations = torch.randn((10,4))
actions = torch.randint(0,2,(10,))
rewards = torch.randn(10)

# Compute policy loss with a baseline
log_probs = []
values = []
for obs, action in zip(observations, actions):
    logits = policy_net(obs)
    dist = torch.distributions.Categorical(logits=logits)
    log_probs.append(dist.log_prob(action))

    val = baseline_net(obs)
    values.append(val)

log_probs = torch.stack(log_probs)
values = torch.cat(values).squeeze()
rewards = rewards.detach()

# Subtract baseline (value) from rewards to reduce variance
advantages = rewards - values

policy_loss = -torch.mean(log_probs * advantages.detach())

# Value function (baseline) loss
value_loss = torch.mean((values - rewards)**2)

optimizer.zero_grad()
policy_loss.backward()
optimizer.step()

baseline_optimizer.zero_grad()
value_loss.backward()
baseline_optimizer.step()

The above code implements a typical policy update with a learned baseline (value function). By subtracting values from rewards, we reduce the variance of the policy gradient estimate.

How Baselines or Control Variates Impact the Bias

One critical consideration is ensuring the approach remains unbiased. If a baseline or control variate depends on parameters in a way that breaks the independence needed for the gradient estimate, you can inadvertently introduce bias. Typically, a baseline is chosen or learned with a separate objective, or chosen so that its gradient does not interfere with the policy gradient. This keeps the overall gradient estimate unbiased or at least introduces only minimal bias that can be managed.

Follow-up Questions

What is the difference between a baseline and a control variate?

A baseline in reinforcement learning is often treated as a special case of a control variate. Control variates are any functions whose expected value is known (or can be computed or approximated reliably), and from which you can subtract your random variable of interest to reduce variance. A baseline is often considered a simpler, universal choice of control variate. In advanced scenarios, more sophisticated or problem-specific control variates can be used to further reduce variance.

How does the reparameterization trick reduce variance in practice?

When you use the reparameterization trick, you write your random variable x as a deterministic function of (theta, epsilon), where epsilon is an external noise source with a simple distribution. This lets you compute gradients by directly differentiating through the function that generates x. By contrast, if you simply treat x as coming from p_theta(x), you end up with gradient estimators that rely on log p_theta(x) and can lead to high variance (as in score-function estimators). The reparameterized gradient usually has far lower variance because small changes in theta cause smooth changes in the output, whereas the log-probability approach often yields high-variance estimates.

What are some pitfalls or limitations with variance reduction methods?

One pitfall is the added complexity of learning or engineering a good control variate or baseline. If the baseline is not well-chosen or is poorly fit, it may not reduce variance much or can even introduce bias if not handled properly. In the case of reparameterization, certain distributions (e.g., discrete distributions) are not trivially reparameterizable; special techniques or relaxations (like the Concrete/Gumbel-Softmax trick) may be required. Additionally, while variance is lowered, you should ensure that the computational overhead of your variance reduction technique does not outweigh the benefits. This can happen in large models with complicated baselines or advanced control variate designs.

How do you decide which method to use in practice?

It depends on the type of problem and distribution:

If you have continuous latent variables where a natural reparameterization is possible (such as Gaussian distributions), the reparameterization trick is often preferred because of its lower variance and strong empirical performance.
If your task is a reinforcement learning problem with discrete actions, you typically use policy gradient estimators plus baselines or advantage estimators.
For complex distributions or cases where reparameterization is not straightforward, control variates or more advanced baseline methods might be the best approach.
Sometimes you can combine these methods (e.g., use a reparameterization trick in part of the model and also use a control variate for another part).

In all cases, you monitor the variance of the gradient and the learning stability to see if the chosen variance reduction method is helping. If training remains unstable or the gradient estimates still show high variance, you may need a different or more elaborate approach.

Below are additional follow-up questions

What is the difference between variance and bias in Monte Carlo gradient estimates, and how do we address each type of error?

Variance reflects how much the gradient estimates fluctuate around their true mean value. When the variance is high, you might see large swings in gradient updates from one batch of samples to another. Bias, on the other hand, is a systematic offset where your gradient estimates deviate from the true gradient in a consistent (but incorrect) direction. High variance causes training instability or slow convergence because of the noisy signal, whereas high bias steers the model’s parameter updates away from the true optimum.

Addressing variance typically involves techniques like control variates, baselines, and reparameterization, each of which attempts to tighten the distribution of the gradient estimates around the true mean without shifting the mean itself. Addressing bias usually means ensuring that the gradient estimator remains unbiased—or if there is some bias, that it’s sufficiently small or reducible over time. For instance, a baseline can remain unbiased as long as it does not depend on the specific random sample in a way that correlates with the gradient. In some scenarios, a small bias may be acceptable if it drastically reduces variance, leading to better practical convergence.

Potential Pitfalls:

Introducing a baseline that inadvertently depends on the sampled actions or latent variables in a way that breaks the independence assumption can create bias.
Some advanced variance reduction strategies may trade a small amount of bias for reduced variance. In these cases, one must ensure that the introduced bias is manageable or eventually vanishes.

In policy gradients, how can we incorporate variance reduction techniques if we do not have an explicit parametric baseline?

If there is no explicit parametric baseline (like a value function approximator), you can still apply simpler baseline strategies. A common non-parametric approach is to use the empirical average of rewards collected in a batch of rollouts as a constant baseline. Concretely, you would compute the mean reward across all sampled episodes (or steps) in your batch, then subtract that mean from each sampled reward before multiplying by the log-prob terms. This keeps the estimator unbiased while reducing variance—especially if the average reward is a fair approximation of the expected reward.

Another possibility is to design a hand-crafted baseline that depends on domain-specific knowledge. For example, if you know that certain states have a typical reward pattern, you can define a baseline that approximates that expected reward. The key requirement is that the baseline should not be correlated with the random noise in the sample in a way that changes the expected gradient.

Potential Pitfalls:

Hand-crafted baselines based on heuristic knowledge might become outdated or incorrect as the policy evolves over time.
Using a purely constant baseline can help, but might not be as effective as a learned baseline, especially if the reward distribution is highly non-stationary.

Could a poorly chosen baseline cause overcompensation and hinder learning?

Yes. If the baseline consistently overestimates or underestimates the expected return, it can either inflate or diminish the estimated advantage, leading to incorrect gradient signals. Overcompensation occurs when the baseline is so large that the resulting advantage term (reward - baseline) becomes negative too often or is heavily damped. Underestimation can cause the advantage term to remain too high and inflate updates in certain directions.

Monitoring the variance of the advantage or reward minus baseline is one way to detect a problematic baseline. If the baseline’s error systematically grows, it can degrade performance. A typical safeguard is to allow the baseline to be learned in tandem with the policy, usually by optimizing a separate loss that forces the baseline to predict returns accurately, but to do so in a way that does not bias the policy updates.

Potential Pitfalls:

If the baseline is updated too slowly, it may lag behind the policy changes and become stale.
If the baseline is updated too aggressively, it might chase recent noisy returns rather than capturing stable long-term estimates.

What is the difference between REINFORCE and the reparameterization trick from a theoretical standpoint?

REINFORCE (score-function estimator) treats the gradient of an expectation over random variables by relying on the log-likelihood trick. In simpler text form, the idea is: gradient of E[R(x)] can be written as E[R(x) * gradient of log p(x)]. This can lead to high variance since R(x) might have large magnitude, and the log p(x) term can also fluctuate significantly.

The reparameterization trick, however, rewrites the random variable x as a deterministic function of parameters and a source of noise. For instance, if x ~ Normal(mu, sigma^2), we write x = mu + sigma * epsilon, where epsilon ~ Normal(0, 1). That way, the gradient can be taken directly with respect to mu and sigma, passing inside the expectation operator. This method typically has lower variance because the dependence on parameters is smoother and does not involve a product with log p(x).

Potential Pitfalls:

REINFORCE can be applied to a wide range of distributions, including discrete ones, whereas direct reparameterization is straightforward primarily for continuous, differentiable distributions.
The reparameterization trick requires you to be able to rewrite the sampling procedure in a differentiable manner. For complex or discrete distributions, specialized reparameterizations (e.g., Gumbel-Softmax) or relaxing assumptions are needed.

How can we mitigate variance in extremely noisy reward signals, such as in sparse reward tasks?

Sparse reward tasks, where the agent only occasionally sees a non-zero reward, are notoriously prone to high variance because most rollouts yield near-zero informative signals. A few strategies include:

Using reward shaping: Provide intermediate or shaped rewards that guide the agent toward the sparse reward states. Although careful to avoid introducing heavy bias, some shaping can reduce gradient variance by ensuring the policy receives more frequent updates.

Hindsight or “goal replay” strategies: If the environment allows, re-labelling trajectories to consider alternative “goals” can effectively transform sparse-reward tasks into more frequent reward signals.

Exploration techniques: Encouraging more diverse state visits (e.g., via intrinsic motivation or curiosity-driven reward) can reveal the sparse rewards more consistently, reducing variance by enabling more stable learning signals.

Potential Pitfalls:

Over-shaping the reward can lead to the agent optimizing for surrogate behaviors that do not align with the true objective.
Intrinsic motivation signals might dominate the main reward if not tuned, leading to unwanted exploration loops.

When does the control variate approach become computationally expensive or infeasible?

Control variates often require you to estimate or compute E[f(x)], where f(x) is the chosen control variate. If the control variate is complex, or if its expectation is itself difficult to approximate, the overhead might become prohibitive. Moreover, you often need to optimize the coefficient alpha that scales the control variate in order to achieve maximum variance reduction. If your model or dataset is large, performing these additional computations for each update can slow down training significantly.

Another challenge arises if the distribution you are sampling from shifts frequently. In that case, the control variate needs to be updated or re-trained constantly to remain a good approximation. This can create a moving-target problem where the overhead is too high compared to the advantage you gain from variance reduction.

Potential Pitfalls:

If the cost of computing the control variate or its expectation is higher than the cost saved by variance reduction, you end up with a net slowdown.
Overfitting the control variate to a particular region of the distribution might lead to suboptimal variance reduction when the policy or parameters move away from that region.

How do we apply variance reduction techniques in hierarchical or multi-level models where multiple latent variables are present?

In hierarchical or multi-level models (e.g., nested latent variables), you might apply the reparameterization trick to each level where it is feasible. For example, if you have a top-level latent variable z1 from which you sample z2, you can reparameterize both distributions. In practice, you do something like:

z1 = g1(theta, epsilon1) z2 = g2(z1, theta, epsilon2)

You then differentiate through both sampling procedures. If certain variables are discrete, you might need discrete reparameterization (like Gumbel-Softmax) or rely on score-function estimators with control variates for those parts. You could also incorporate separate baselines for each layer to handle the incremental variance introduced at each level of sampling.

Potential Pitfalls:

The complexity of multiple layers can cause compounding variance if not addressed carefully, making single-level variance reduction insufficient.
If some layers are discrete or only partially differentiable, mixing reparameterization (for continuous parts) with score-function estimators (for discrete parts) can be tricky in terms of implementation and debugging.

How do we properly tune the coefficient alpha in a control variate approach for maximum variance reduction?

The coefficient alpha is typically chosen to minimize the variance of R(x) - alpha * (f(x) - E[f(x)]). A common approach is to analytically solve for alpha = Cov(R(x), f(x)) / Var(f(x)) if those quantities can be estimated. In practice, you compute sample estimates of Cov(R, f) and Var(f) from your batch of samples, then pick alpha accordingly.

Another approach is to treat alpha as a learnable parameter and optimize it via gradient descent based on the variance of the resulting estimator. This is more flexible but potentially more unstable if not carefully regularized, because alpha might fluctuate significantly during training.

Potential Pitfalls:

If you rely purely on sample-based estimates of Cov(R, f) in small batches, alpha might be noisy and cause large swings in variance.
Overfitting alpha to reduce variance on a specific batch might degrade generalization for future updates, especially if your distribution evolves over time.

When do specialized methods like Gumbel-Softmax become relevant, and are they a form of reparameterization?

Gumbel-Softmax becomes relevant when dealing with categorical discrete variables that you want to reparameterize. It provides a continuous relaxation of discrete samples by using a softmax transformation of Gumbel noise. Mathematically, it approximates a one-hot vector with a differentiable function, enabling backpropagation through discrete choices.

Yes, Gumbel-Softmax is indeed a form of reparameterization specifically tailored for categorical distributions. It transforms uniform(0,1) random noise into Gumbel(0,1), then applies a softmax with a temperature parameter that controls how close the samples are to discrete one-hot vectors. As the temperature approaches zero, the distribution becomes more categorical, but the gradient may vanish or become extremely high.

Potential Pitfalls:

Using too low a temperature too soon can make the model updates very unstable.
This continuous relaxation is still an approximation to true discrete sampling. If the approximation is too crude, you might get suboptimal solutions or mismatch between training and evaluation (where you might do fully discrete sampling at test time).

If the distribution we sample from is discrete and large, is it always feasible to reparameterize? Or do we need alternative strategies?

For large discrete distributions (e.g., large vocabularies in language modeling), direct reparameterization is often infeasible. The standard Gumbel-Softmax reparameterization trick might become prohibitively expensive, especially if you need to sample from thousands or millions of possible outcomes each time. Additionally, the approximation can become coarse, and the memory and computation overhead can be significant.

In such cases, alternative strategies include:

Using approximate sampling techniques (e.g., importance sampling, skip-gram-like negative sampling).
Employing specialized gradient estimators like REINFORCE with variance reduction or discrete control variates that do not require enumerating the entire space.
Using hierarchical softmax or other structured decomposition to reduce the effective dimensionality of the problem, thereby making partial reparameterization or approximate discrete sampling more tractable.

Potential Pitfalls:

Large discrete spaces exacerbate the variance problem. Even with advanced estimators, you might still encounter slow convergence.
Implementing advanced sampling or hierarchical decompositions can be complex and error-prone, requiring careful debugging.

Rohan's Bytes

Discussion about this post