ML Interview Q Series: How can you address the problem of overfitting in Linear Regression models?

Apr 09, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

Overfitting in Linear Regression arises when the model captures not only the genuine underlying data relationships but also noise or spurious fluctuations. This leads to excellent performance on the training set but poor generalization on unseen data. The central idea to mitigate overfitting is to introduce mechanisms that simplify the model or reduce the variance in the coefficient estimates. Several approaches are commonly used.

Connect with me on X (Twitter)

Data Size and Feature Engineering

Collecting more data can allow the model to learn more robust relationships. If the number of features is too large relative to the number of observations, each coefficient may overfit. Reducing irrelevant features (e.g., through domain knowledge or feature selection) can help simplify the model’s hypothesis space.

Cross-Validation

Using cross-validation provides a reliable measure of how the model performs across multiple subsets of the data. This helps detect overfitting, because if a model is overfitting, the performance will drop significantly on validation folds compared to the training set. Techniques such as k-fold cross-validation or leave-one-out cross-validation not only diagnose overfitting but can be used to tune hyperparameters (e.g., the regularization strength).

Regularization Techniques

When linear regression models have many parameters, regularization is often the most direct and popular solution to address overfitting. Two frequently used regularization methods are Ridge Regression and Lasso Regression.

Ridge Regression

Ridge Regression introduces an L2 penalty on the coefficient sizes. The objective function augments the ordinary least squares with a term penalizing the sum of squared coefficients. It is expressed by the cost function:

Here, w denotes the coefficient vector, x_i denotes the feature vector of the i-th example, y_i is the target value for the i-th example, N is the total number of training examples, d is the dimension of the feature vector, and lambda is a hyperparameter that controls the strength of the penalty. A larger lambda shrinks coefficients more aggressively, helping to reduce model variance and thus reduce the chance of overfitting, but potentially increasing bias.

Ridge Regression will not generally drive coefficients to zero but will constrain them to smaller magnitudes. This makes it especially suitable when you have many correlated features, since it distributes the penalty across correlated coefficients without ignoring them entirely.

Lasso Regression

Lasso Regression applies an L1 penalty on the coefficient magnitudes. Its objective function is:

The hyperparameter alpha controls the amount of L1 shrinkage. An appealing property of Lasso is that some coefficients can become exactly zero, effectively performing feature selection. This can simplify the model substantially, making it more interpretable and potentially mitigating overfitting by removing less important features. However, if there are many correlated features, Lasso may arbitrarily keep one and zero out others, which may not be ideal if multiple features share similar significance.

Elastic Net

Elastic Net combines Ridge (L2) and Lasso (L1) penalties, offering a balance between the two. This is particularly useful when you have many correlated features but still want automatic feature selection properties. It is a practical approach in cases where purely Ridge or purely Lasso methods show weaknesses.

Early Stopping

Although less common in basic linear models, some implementations of linear regression (especially iterative ones like gradient descent) can use early stopping. Stopping the optimization while validation loss is still decreasing but before the training loss fully converges helps avoid overfitting. This technique is more frequently discussed in the context of neural networks, but the core idea of halting training at the point of best validation performance also applies here.

Dimensionality Reduction

Methods such as Principal Component Analysis can reduce feature dimensionality. This approach is helpful if the original features are highly correlated or if there is significant noise. By projecting data into a lower-dimensional subspace, the model focuses on the most dominant factors, thereby reducing overfitting risk.

Practical Code Example

import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score

# Suppose X is your feature matrix and y is your target vector
X = np.random.rand(100, 10)
y = np.random.rand(100)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Ridge
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)
ridge_score = ridge_model.score(X_val, y_val)

# Lasso
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)
lasso_score = lasso_model.score(X_val, y_val)

print("Ridge Validation R^2 Score:", ridge_score)
print("Lasso Validation R^2 Score:", lasso_score)

In this code, the alpha parameter is the regularization strength for Ridge or Lasso. Cross-validation can be used to find the best alpha:

from sklearn.model_selection import GridSearchCV

alpha_values = [0.001, 0.01, 0.1, 1, 10, 100]
param_grid = {'alpha': alpha_values}

ridge_cv = GridSearchCV(Ridge(), param_grid, cv=5)
ridge_cv.fit(X_train, y_train)

print("Best alpha for Ridge:", ridge_cv.best_params_)
print("Best CV Score for Ridge:", ridge_cv.best_score_)

This procedure searches for the optimal alpha based on the average performance across the folds, helping to mitigate overfitting.

Follow-up Questions

How do you choose between Ridge and Lasso when combating overfitting?

Ridge is often beneficial in situations where you suspect correlated features might exist. It shrinks those coefficients but retains most features. Lasso zeroes out coefficients, effectively doing feature selection. One might prefer Lasso if there is a belief that many features are irrelevant or if interpretability is important, since zero-coefficient features are excluded from the model. However, for highly correlated features, Lasso can arbitrarily zero out one from a correlated group. This can be mitigated by using Elastic Net, which combines both penalties.

Why might cross-validation be a better measure of model performance than a simple train-validation split?

Cross-validation yields multiple estimates of out-of-sample performance by iterating which portion of data is used for training and which is used for validation. This reduces variance in the performance estimate, especially if the dataset is not very large. A single train-validation split could lead to unstable estimates if the split is not representative. Cross-validation helps ensure that every data point is used for both training and evaluation in different folds, improving reliability in tuning hyperparameters or comparing models.

Could you combine dimensionality reduction with L1/L2 regularization?

Yes. Dimensionality reduction through methods like PCA can be applied first to eliminate noisy, correlated features. Then, a linear regression model with an L1 or L2 penalty can be trained on the reduced feature set. However, if your goal is interpretability, you may lose some intuitive understanding of which original features matter after a PCA transformation, since principal components are linear combinations of original features. On the other hand, regularization directly on the original features can keep interpretability more transparent.

Is there any possibility of overfitting with regularization?

While regularization reduces the chance of overfitting, it is not impossible for a model to still overfit if the hyperparameter for regularization is not set appropriately or if the training data is not representative. A very small lambda (Ridge) or alpha (Lasso) might not provide sufficient shrinkage, whereas a very large value could underfit. Cross-validation is typically used to find an optimal range for these hyperparameters.

Rohan's Bytes

Discussion about this post