ML Interview Q Series: How can the discrepancy between actual and predicted values be measured in a linear regression framework?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
The primary goal in linear regression is to fit a straight line (or in higher dimensions, a hyperplane) that best captures the relationship between input variables and the target variable. The measure of "best fit" is often quantified by how close the model’s predictions are to the actual data points. This closeness is usually measured by the error (also called the cost or loss).
One of the most commonly used error functions in linear regression is the Mean Squared Error (MSE). This error function measures the average squared distance between the model predictions and the real target values. Although there are variations—such as the Sum of Squared Errors (SSE), Root Mean Squared Error (RMSE), or Mean Absolute Error (MAE)—the essence remains: we compare predicted values to actual values and measure the difference.
Below is the key formula for MSE in a linear regression context:
Where:
n is the number of data points.
y_i is the actual target value for the i-th data sample.
hat{y}_i is the predicted value for the i-th data sample using the linear regression model.
In a practical sense, we use MSE (or a related measure) because squaring the residual (the difference between actual and predicted) penalizes larger errors more heavily, which often helps during optimization. From this perspective, the training process of linear regression (e.g., via the Normal Equation or Gradient Descent) will attempt to find the line parameters that minimize MSE.
Error Functions and Why Squared Errors
Squaring the residuals has some intuitive and mathematical advantages. It ensures the error is always non-negative and emphasizes larger deviations. Moreover, MSE is differentiable with respect to the model parameters, making it straightforward to optimize using gradient-based methods like Gradient Descent.
On the other hand, if the dataset contains heavy outliers, MSE might be overly influenced by these points, because squaring the difference makes large residuals stand out. In those cases, some practitioners might turn to other metrics such as MAE, which can be more robust to large deviations.
Example Code for Computing MSE
import numpy as np
# Suppose y_true and y_pred are NumPy arrays of equal length
y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.1, 7.8])
def mean_squared_error(y_true, y_pred):
# Compute the difference for each data point
residuals = y_true - y_pred
# Square those differences
squared_residuals = residuals ** 2
# Average the squared differences
mse = np.mean(squared_residuals)
return mse
mse_value = mean_squared_error(y_true, y_pred)
print("MSE:", mse_value)
How Gradient Descent Minimizes This Error
When using gradient descent, the update to each parameter (for example, in a univariate linear regression, the slope m and intercept b) is derived by taking the partial derivative of the MSE cost function with respect to those parameters. Because MSE is smooth, finding these gradients and updating parameters iteratively is straightforward. At each iteration:
The model computes predictions hat{y}_i.
We calculate the residual (y_i - hat{y}_i).
We compute the gradient of MSE with respect to m and b.
We update m and b in the direction that reduces the MSE.
Potential Issues to Watch Out For
High Leverage Outliers When there are extreme data points with large feature values, they can disproportionately impact the slope. Squared error intensifies the effect of large residuals, hence the presence of just a few outliers could skew the result significantly.
Overfitting vs. Underfitting While MSE gives you a concrete measure, it does not directly tell you if you are overfitting or underfitting. Monitoring how MSE behaves on training and validation sets is important in identifying whether the model is too complex or too simplistic.
Follow-up Questions
Can the Mean Absolute Error (MAE) be used instead of MSE, and what would be the trade-offs?
When you use MAE, you measure the average of the absolute differences between predictions and targets. This function is more robust to outliers since large deviations are not squared but are taken in absolute value. On the downside, MAE can be less mathematically convenient for optimization because it is not differentiable at zero. In practice, the subgradient can be used, but it might not be as straightforward to minimize as MSE in a purely analytical sense.
How would you handle a situation where the cost function is not decreasing during gradient descent?
You might check several things:
Learning rate may be too large, causing the updates to overshoot the minimum.
Data might not be normalized or standardized, leading to pathological gradients.
Implementation bugs in gradient calculation or parameter updates.
Presence of NaN or infinite values due to exploding gradients.
Why does the Normal Equation approach not always scale well to large feature sets?
The Normal Equation requires the computation of a matrix inverse (or related decomposition) of the feature matrix X^T X. This can be computationally expensive. For instance, if the feature matrix is of size n x d, inverting X^T X (which is d x d) can become computationally intensive when d is large. Gradient-based methods are often more practical at scale because they can rely on iterative procedures without requiring explicit matrix inversion.
In practice, how do you typically evaluate the performance of a linear regression model aside from MSE on the training set?
One common approach is to hold out a validation set or use cross-validation. By measuring the MSE (or other metrics like R^2, MAE, or RMSE) on the training data and the validation data, you can assess if your model is overfitting or underfitting. Additionally, domain-specific error metrics might be more relevant in certain applications (for example, Mean Absolute Percentage Error for sales forecasting).
How do you incorporate regularization in the error calculation for linear regression?
In regularized linear regression (such as Ridge or Lasso), a penalty term is added to the MSE cost. For Ridge Regression, we add the sum of squared parameter values multiplied by a regularization parameter (alpha). For Lasso, we add the absolute value of the parameters multiplied by alpha. This modifies the cost function so that it not only seeks to minimize error but also penalizes large coefficients, thereby reducing variance.
By carefully choosing the regularization parameter, we can control the bias-variance trade-off and help mitigate overfitting.