ML Interview Q Series: Explain the role of regularization terms in the cost function. How does adding L_1 vs. L_2 regularization affect the shape and optimization landscape of the cost function?
📚 Browse the full ML Interview series here.
Hint: Think about how each regularization term penalizes parameters differently and impacts sparsity vs. smooth decay.
Comprehensive Explanation
Regularization terms are used to discourage overly large parameter values or overly complex models. They modify the cost function by adding a penalty for the magnitude of the parameters, thereby helping models generalize better and avoid overfitting. When we talk about L1 vs. L2 regularization, the key difference is how the penalty term is calculated and how it impacts the geometry of the optimization problem.
L1 Regularization (Lasso)
L1 regularization involves adding a penalty proportional to the absolute value of each parameter. A simplified cost function with L1 regularization (assuming a mean squared error term for illustration) can be written as:
Below is how it differs:
• Geometry of the penalty: In parameter space, the squared norm creates circular (or spherical in higher dimensions) contours. The circular shape lacks sharp corners, thus it does not encourage zero coefficients as strongly as L1. • Smooth shrinkage: All coefficients get smoothly reduced in magnitude (they “shrink” towards zero but not exactly to zero). This can help avoid overfitting by distributing the impact of regularization across all parameters rather than completely eliminating some. • Differentiability: The squared term is differentiable everywhere, so gradient-based methods are straightforward to apply. • Multi-collinearity handling: In contexts like linear regression, L2 regularization is known to be effective at handling correlated features by distributing weights among them.
How L1 vs. L2 Affects the Shape and Optimization Landscape
• L1 (diamond-shaped contours): The gradient directions can abruptly change where parameters cross zero. This geometry promotes sparsity, but can be more challenging to optimize when parameters hover around zero. • L2 (circular contours): Encourages a smoother decrease in parameter magnitudes. It rarely drives parameters exactly to zero, leading to a more uniform shrinkage.
Practical Implementation in Python
Below is a short snippet illustrating how you might implement L1 or L2 regularization in a simple PyTorch model context. Note that in practice, most frameworks offer built-in regularization options, but this shows a manual approach:
import torch
import torch.nn as nn
import torch.optim as optim
# Simple linear model
class SimpleModel(nn.Module):
def __init__(self, input_dim, output_dim):
super(SimpleModel, self).__init__()
self.linear = nn.Linear(input_dim, output_dim)
def forward(self, x):
return self.linear(x)
# Instantiate model
model = SimpleModel(input_dim=10, output_dim=1)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
# Example training loop with custom L1 or L2 penalty
lmbd = 0.001 # regularization strength
for epoch in range(100):
inputs = torch.randn(32, 10)
targets = torch.randn(32, 1)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
# L2 penalty: sum of parameter^2
l2_penalty = torch.tensor(0.0)
for param in model.parameters():
l2_penalty += torch.sum(param**2)
# L1 penalty: sum of absolute values
l1_penalty = torch.tensor(0.0)
for param in model.parameters():
l1_penalty += torch.sum(torch.abs(param))
# Choose either L1 or L2, not both (for demonstration)
# total_loss = loss + lmbd * l2_penalty # L2
total_loss = loss + lmbd * l1_penalty # L1
total_loss.backward()
optimizer.step()
The above snippet shows a conceptual illustration of how to manually add each penalty term. In real-world projects, you would pick the appropriate penalty based on your needs.
Why Does L1 Induce Sparsity?
L1 regularization effectively introduces a constraint that is geometrically shaped like a diamond (or an L1 ball). The edges of this diamond meet at coordinate axes, encouraging the solution to land on these axes. Solutions on the axes mean some parameters are exactly zero. This penalty’s derivative is constant in magnitude (except at zero), so once a weight’s gradient tries to cross zero, it gets pulled to zero strongly.
What About the Bayesian Perspective?
In a Bayesian viewpoint, L1 regularization corresponds to a Laplace prior on the parameters, while L2 corresponds to a Gaussian prior. The Laplace prior (L1) places more mass near zero, leading to a higher likelihood of parameters being zero. The Gaussian prior (L2) instead favors smaller but non-zero coefficients.
How Do We Choose L1 vs. L2 in Practice?
Choice depends on your objectives and data:
• L1 is particularly useful when you want to perform automatic feature selection or you suspect that only a few features are truly relevant. • L2 tends to work better in situations where you expect many small, correlated effects that all contribute to the outcome.
Is It Possible to Combine L1 and L2?
Yes. Elastic Net is a technique that combines L1 and L2 regularization, benefiting from both sparsity (via L1) and stable coefficient shrinkage (via L2). It can be useful if pure L1 or pure L2 does not yield an optimal solution for your particular dataset.
Follow-Up Questions
Could You Explain the Optimization Landscape Differences Between L1 and L2 in More Detail?
For L1, the absolute value penalty creates “kinks” (non-differentiable points) at zero, so when performing gradient-based optimization, the gradient changes abruptly if a coefficient tries to pass through zero. This can cause coefficients to become stuck at zero. For L2, the penalty is differentiable everywhere, making the cost function smooth and ensuring gradients vary gradually with changes in the parameters.
What Happens If We Use Very Large Lambda?
With very large lambda, the regularization term dominates:
• In L1, parameters get forced to zero en masse. You might end up with an overly sparse model that could discard important features. • In L2, all parameters become very small and close to zero, potentially oversimplifying the model.
Excessively large lambda can harm the model’s ability to capture the underlying data relationships.
Why Do We Often See L1 Used in High-Dimensional Settings?
In high-dimensional problems, L1 helps by zeroing out irrelevant features, effectively performing feature selection. When you have thousands or millions of features (like in text analysis with bag-of-words), L1 is extremely useful because it naturally and automatically prunes out many useless features, improving interpretability and sometimes performance.
How Does One Decide Which Regularization Method to Use?
Ultimately, it often comes down to experimentation and prior knowledge:
• If you suspect that only a handful of features drive the outcome, L1 is a good choice. • If you suspect most features have some small contribution, L2 is often better. • If you want a middle ground, Elastic Net is worth considering. • Cross-validation is typically used to tune lambda (and mixing parameters for Elastic Net).
These considerations highlight how the shape of the regularization penalty—whether forming a diamond (L1) or a sphere (L2)—impacts model behavior and optimization in fundamental ways.
Below are additional follow-up questions
Can L1 or L2 Regularization Alleviate Overfitting When Dealing With Outliers?
Outliers can strongly affect a model's parameters, especially in linear regressions. L2 regularization (Ridge) shrinks parameters but might still be pulled by a large outlier. L1 regularization (Lasso), however, can soften the influence of outliers by driving certain coefficients exactly to zero, but if an outlier exerts a strong effect on a single coefficient, the penalty might not be sufficient to completely negate it. A key subtlety is that while both methods reduce parameter magnitudes, they do not explicitly remove the impact of outliers on the loss. One might need robust loss functions (like Huber loss) in addition to or instead of standard regularization. Another edge case is when the dataset has multiple severe outliers in different dimensions: L1 may zero out some dimensions but leave others vulnerable, while L2 may shrink all dimensions but not necessarily mitigate an extremely large outlier. Thus, although L1 and L2 help control general overfitting by penalizing large coefficients, neither is a guaranteed solution to outlier-driven problems.
How Do L1 and L2 Regularization Interact With Normalization or Standardization of Features?
Feature scaling practices significantly influence the behavior of both L1 and L2. If features are not on comparable scales, L2 might push one feature’s coefficient to shrink more (simply because that feature has a larger numerical range). Similarly, with L1, if one feature is orders of magnitude larger than another, it may get penalized differently, affecting which coefficients move to zero. When features are standardized or normalized, the regularization term influences all coefficients more uniformly. A pitfall arises if some features are not scaled properly, in which case either penalty may disproportionately affect certain dimensions, leading to suboptimal generalization. Proper preprocessing is essential to ensure the penalty is applied evenly and the model does not unfairly favor certain dimensions.
How Do Partial Derivatives Differ for L1 vs. L2 Regularization and Why Does This Matter?
For L2, the partial derivative of the penalty term with respect to a parameter theta_j is 2 * lambda * theta_j (in plain text: 2 x lambda x theta_j). This is a smooth, continuous function that is easy to handle in gradient-based optimizers. In contrast, the derivative of L1’s absolute value term is lambda * sign(theta_j), which is discontinuous at theta_j = 0. This discontinuity matters because it can create a “sharp corner” in the optimization landscape. When theta_j crosses zero, the sign changes abruptly, forcing specialized optimization techniques or subgradients. In practice, frameworks often implement coordinate descent or proximal gradient methods for L1 to handle these corners gracefully. A key pitfall is that naive gradient methods might not handle the discontinuity well, leading to numerical instability or failing to converge exactly to zero for some parameters.
Are There Practical Scenarios Where L1 or L2 Regularization Fails to Prevent Overfitting?
Yes. If the data has complex relationships or very high variance with inadequate coverage, simple linear or logistic models (even regularized) can still overfit or underfit. In neural networks, merely adding L1 or L2 might not be enough to fully control overfitting if the network is extremely deep or over-parameterized. Additional techniques such as dropout, data augmentation, or early stopping might be required. In cases with extremely noisy data, both L1 and L2 can be overwhelmed if lambda is not well-tuned; too small a lambda fails to mitigate overfitting, while too large a lambda can cause underfitting. Thus, hyperparameter tuning, model selection, and architectural choices often need to complement regularization methods to ensure proper generalization.
How Does Regularization Impact Interpretability in Large Neural Networks?
Neural networks inherently have complex, layered representations. L1 can induce sparsity in the weight matrices, potentially simplifying the network’s effective complexity, but the underlying learned representations can still be quite abstract and not easily interpretable in a human sense. L2 simply shrinks the weights, so it does not necessarily enforce zero weights but does keep overall magnitudes smaller. Consequently, L2 alone does little to aid direct interpretability. A subtle edge case arises in tasks where certain layers or neurons can become “dead” if L1 pushes their parameters effectively to zero, diminishing that neuron’s influence. This can simplify model structure but might also degrade performance if the network architecture heavily relied on that neuron’s activity. Therefore, while L1 can help produce some level of interpretability by zeroing out parameters, deeper or more convolutional architectures can still remain opaque overall.
Can Combining Dropout and L2 Regularization Cause Any Issues?
Dropout randomly sets neuron activations to zero during training, reducing co-adaptations. L2 shrinks weights. When both are used aggressively, the effective training signal can become weak. The network might learn overly conservative weights, slowing down convergence or underfitting. A subtle pitfall is if you apply extremely high dropout (e.g., dropout rate close to 0.8) and a large L2 penalty: the model might not converge to a sufficiently rich representation. Nonetheless, combining dropout and moderate L2 typically works well in many real-world neural networks. Careful hyperparameter tuning and monitoring validation loss are needed to ensure you are not starving the network of the capacity to learn meaningful patterns.
How Does the Choice of Regularization Interact With Noisy Inputs in a Nonlinear Setting?
In nonlinear settings (e.g., neural networks, tree-based models), noise in inputs can make it difficult for the model to learn stable patterns. L2 can help the network handle noisy inputs by ensuring the model does not rely on extremely large weights for unstable features, leading to smoother decision boundaries. L1 can force some input weights to zero, effectively ignoring certain noisy features. However, if noise is distributed across many features and none can be singled out as purely irrelevant, L1 might zero out some features that could still contribute slight predictive power, potentially harming performance. A real-world pitfall is that in very noisy datasets, neither approach alone may suffice if the network capacity is large; you might need advanced techniques like robust data preprocessing, domain-driven feature engineering, or specialized architectures that incorporate uncertainty modeling to fully mitigate the issues caused by noisy inputs.