ML Interview Q Series: How can you verify whether a linear regression model satisfies all the usual assumptions needed for valid regression analysis?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Linear regression is founded on several core assumptions. The validity of the model’s inference heavily depends on whether these assumptions are met. Although linear regression can sometimes still produce decent predictions when a few assumptions are slightly violated, major deviations can compromise interpretability and reliability. The assumptions typically include linearity, independence of errors, homoscedasticity, normality of error terms, and no or minimal multicollinearity among features. The following sections detail each assumption and how to verify whether a linear model meets these criteria.
Linearity of Relationship
One major assumption is that the dependent variable has a linear relationship with the predictors. To check linearity, residuals (the difference between the observed values and the model’s predicted values) are often examined. If the relationship is linear, the residuals should appear as random noise around zero with no discernible pattern when plotted against predicted values or individual predictors.
You can visually inspect the residual-vs-predicted plot. If the plot shows a clear curve, funnel shape, or other systematic pattern, this suggests non-linearity or that some predictor transformations or polynomial terms may be necessary.
Independence of Errors
The model’s errors (residuals) should be independent of each other. When the data are sequential (time series) or organized in clusters, the errors might exhibit correlation. If your errors are correlated, some statistical tests (like t-tests for coefficients) can become invalid since they assume independent errors.
A Durbin-Watson test can formally test for correlated errors in a time-series context. If the Durbin-Watson statistic is substantially lower or higher than 2, it indicates positive or negative autocorrelation in the residuals. In non-time-series contexts, you should also consider how the data were collected to ensure independence.
Homoscedasticity (Constant Variance of Errors)
This assumption states that the variance of residuals remains constant across all levels of the predicted values. If the residuals exhibit patterns of increasing or decreasing spread, that suggests heteroscedasticity, which can cause problems in estimating the standard errors and confidence intervals accurately.
To check for homoscedasticity, you can again use the residual-vs-predicted plot. If residuals fan out or form a cone shape, the variance of errors is not constant. Sometimes, transforming the target variable (e.g., using log transformations when dealing with highly skewed data) can help stabilize the variance.
Normality of Errors
Linear regression assumes that the errors follow a normal distribution with a mean of zero. Although this assumption can be slightly relaxed in large samples due to the Central Limit Theorem, checking normality is crucial for small-to-moderate datasets or when you aim to construct prediction intervals accurately.
You can use a Q-Q plot of the residuals to see how well they align with a normal distribution. A Shapiro-Wilk test can also be run to check normality formally. Minor deviations are often acceptable, but extreme violations can signal problems with your modeling approach or the presence of outliers.
Minimal or No Multicollinearity
Multicollinearity arises when independent variables are excessively correlated, making it difficult to estimate the influence of each variable individually. Standard errors of the regression coefficients increase, leading to inflated confidence intervals and less reliable statistical significance tests.
You can check Variance Inflation Factors (VIFs) for each predictor. High VIF values (often above 5 or 10, depending on convention) signal serious collinearity issues. Examining correlation matrices among predictors is another simpler but less comprehensive approach.
Outlier and Leverage Points Assessment
Individual observations that deviate significantly from the trend can disproportionately affect the fitted regression line. Outliers can distort parameter estimates, while leverage points (observations with unusual predictor values) can pull the regression line in their direction.
You can look at Cook’s distance and leverage values to identify potentially influential observations. If a small number of data points strongly influences the model, you may need to investigate these points to decide if they are valid or if they result from data errors.
Model Equation
When you fit a linear regression, you typically solve for parameters in an equation that can be expressed in matrix form. One central formula for multiple linear regression can be written as:
Where:
X
is the design matrix containing the intercept and predictor variables. Each row corresponds to an observation, and each column corresponds to a feature (plus the intercept term).beta
is the vector of estimated regression coefficients. Each coefficient corresponds to a predictor (and an intercept).hat{y}
is the vector of model-predicted values.
Below this formula, one usually interprets the residuals as e = y - hat{y}, which should satisfy assumptions about independence, normality, and constant variance.
Practical Checks with Python
A typical workflow in Python for diagnosing model assumptions involves packages like statsmodels
or plotting libraries such as matplotlib
and seaborn
. For example, you might do:
import statsmodels.api as sm
import matplotlib.pyplot as plt
import numpy as np
# Assuming X is your feature matrix (with intercept), y is your target
model = sm.OLS(y, X).fit()
residuals = model.resid
fitted = model.fittedvalues
# Residual plot
plt.scatter(fitted, residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.show()
# Q-Q plot for checking normality
fig = sm.qqplot(residuals, line='45', fit=True)
plt.show()
Inspecting these plots helps you identify non-linearity, changing variance, and potential outliers or influential points. Moreover, you can calculate the VIF to check for multicollinearity:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["feature"] = feature_names
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
Any feature with a VIF beyond an acceptable threshold indicates a collinearity problem.
Common Follow-Up Questions
Can violations of these assumptions invalidate a linear regression model entirely?
Major assumption violations can indeed cause substantial issues, such as biased estimations or unreliable inference. However, minor deviations might not render the model useless, especially if the primary goal is prediction rather than interpretation. In practice, you might employ data transformations, robust regression, or generalized linear models if standard assumptions fail.
How can we handle correlated errors if they are detected in time series data?
When errors are correlated due to time series structures, you can use specialized models designed for autocorrelated data. Techniques like ARIMA or state-space models can be more appropriate. Alternatively, you can add lagged variables or differencing to capture the temporal dependencies, or employ Generalized Least Squares (GLS) which accounts for autocorrelation in the error terms.
What if the normality assumption is seriously violated?
If you discover the errors are significantly non-normal and your sample size is small, your confidence intervals and hypothesis tests could be off. You might perform a data transformation (log, Box-Cox, etc.) or use non-parametric methods. For large sample sizes, thanks to the Central Limit Theorem, the distribution of estimated coefficients often remains reasonably approximated by a normal distribution, reducing the severity of the normality violation.
Why is multicollinearity problematic if all I want is a predictive model?
High multicollinearity may not necessarily degrade predictive performance if the goal is purely to minimize mean squared error. However, it significantly complicates interpretation because the model has difficulty assigning relative importance to correlated features. Also, if you remove or alter the correlated features in future data, the model might not generalize well, leading to instability in coefficient estimates.
How do we decide if outliers need to be removed or if they offer critical information?
Context is the key. If outliers stem from data entry errors or measurement anomalies, you might remove or correct them. If they are valid but unusual data points, they might hold important information about extreme cases. You can conduct a sensitivity analysis by running the model with and without outliers to see how sensitive the results are. Domain knowledge often guides the final decision on whether to keep or remove an outlier.