ML Interview Q Series: What basic premises underlie linear regression, and how do these premises guide the correct application of the model?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Linear regression aims to model the relationship between input variables and a continuous target variable by fitting a linear function of the inputs. A standard form of the model with n
predictors can be represented as:
In this expression, hat{y}
is the predicted value of the target, beta_0
is the intercept term, and beta_i
for i=1..n are the coefficients for the corresponding predictor variables x_i
. The goal in ordinary least squares (OLS) linear regression is typically to find these coefficients that minimize the sum of squared differences between the predicted and actual target values.
Below are the key assumptions that underlie this model. When any of these assumptions is seriously violated, the credibility of statistical inferences such as p-values, confidence intervals, and predictions can be compromised.
Linearity of the Relationship
One fundamental assumption is that there exists a straight-line (linear) relationship between each predictor variable and the target. This means changes in any single predictor lead to a proportional change in the target, while the effect of other predictors remains fixed. If the true relationship is not approximately linear, then applying linear regression without transformations or more sophisticated modeling approaches may lead to biased estimates of the coefficients.
Independence of Errors
Each data point's error term (the difference between the actual and predicted value) should be independent of the errors for the other data points. In time series, for instance, this assumption can be violated when consecutive observations are correlated (autocorrelation). In other domains, grouping or hierarchical structures might also introduce dependencies in the errors. Violating independence can lead to underestimated or overestimated standard errors and flawed statistical conclusions.
Homoscedasticity (Constant Variance of Errors)
The variance of the errors is assumed to be constant across all levels of the predictors. This is referred to as homoscedasticity. If the variance of the residuals changes with the level of the predictor (a phenomenon called heteroscedasticity), then ordinary least squares estimates remain unbiased but their standard errors become invalid, which leads to inaccurate hypothesis tests and confidence intervals.
Normality of Errors
For inference (e.g., constructing confidence intervals, conducting t-tests) in classical linear regression, it is assumed that the error terms are normally distributed around the regression line. This assumption is especially important for small sample sizes. If the sample size is large, then the Central Limit Theorem often mitigates moderate deviations from normality of residuals, but heavy-tailed or very skewed error distributions can still affect results.
No Perfect Multicollinearity
Predictors in the regression should not be perfectly linearly dependent on one another. When perfect or near-perfect linear relationships exist among predictors, the matrix inversion in the normal equations can become numerically unstable (or impossible). This results in inflated standard errors for the coefficients and difficulties in interpreting the significance of each predictor.
Proper Model Specification
Linear regression presupposes that the chosen set of predictors correctly captures the essential structure of the data. Omitting relevant variables (leading to omitted-variable bias) or including irrelevant variables can both lead to distorted estimates. If the true relationship includes polynomial or interaction terms, then failing to incorporate them explicitly can also bias the results.
Observations are Accurately and Randomly Sampled
Another pragmatic assumption is that the training data fairly represents the underlying population of interest and does not include systematic sampling biases. In real-world scenarios, violations might arise if there is selection bias or if the data is not representative of the conditions where the model will be applied in production.
How to Check and Mitigate Assumption Violations
While the question specifically asked about assumptions, an integral part of applying linear regression is to test whether these assumptions are met and to implement appropriate remedies when they fail:
Residual plots (predicted value vs. residual, Q-Q plots, etc.) help assess linearity, homoscedasticity, and normality of errors.
Durbin-Watson test (or similar) can be used for detecting autocorrelation in residuals.
Variance Inflation Factor (VIF) assists in diagnosing multicollinearity.
Transformations (log, square root, or polynomial terms) or robust regression techniques can correct for certain kinds of nonlinearity or non-constant variance.
If violations are severe, more advanced models such as generalized linear models (GLMs) or specialized time-series and mixed-effects models may be appropriate.
Potential Follow-up Questions
How would you verify normality of residuals, and why might it be less critical for large samples?
You can visually inspect a Q-Q (quantile-quantile) plot of the residuals to see whether they deviate substantially from a straight line, which would indicate non-normality. Additionally, formal statistical tests like the Shapiro-Wilk test can be used, although they can be overly sensitive for large datasets.
For large samples, the Central Limit Theorem suggests that the distribution of the mean of the residuals tends toward normal even if individual error terms are not perfectly normal. This is why moderate deviations from normality are often considered tolerable with sufficiently large data. However, if the error distribution is extremely heavy-tailed, skewed, or includes strong outliers, even large-sample linear regression can produce misleading inferences.
If residuals show increasing spread as predictions increase, how does that affect the regression model?
This scenario hints at heteroscedasticity: the variance of errors is not constant across the range of predicted values. Although the estimates of the coefficients themselves remain unbiased, the calculated standard errors become incorrect. That invalidates the usual t-tests and F-tests and makes confidence intervals inaccurate.
One typical remediation is to apply a suitable transformation (e.g., log transform of the target if the magnitude of the output spans multiple orders). Another approach might be using weighted least squares, in which observations with higher variances get proportionally smaller weights.
What is the difference between correlation and multicollinearity in linear regression?
Correlation describes the linear relationship between two variables. By contrast, multicollinearity indicates a scenario in which multiple predictors, possibly more than two, have strong linear dependencies. This can inflate the variance of coefficient estimates, making it hard to separate the individual effect of each predictor. A predictor with high correlation to another variable might still be used if it brings unique explanatory power. However, when there is near-perfect linear dependence among predictors, the least squares solution becomes unstable or non-unique.
Can you explain how omitted-variable bias affects the assumption of correct model specification?
Omitted-variable bias arises when there is a variable not included in the model that is correlated with one or more of the included predictors. By leaving out that important factor, you attribute its effect to the predictors that remain. This makes your coefficient estimates for those predictors inaccurate, violating the assumption that the model correctly captures the structure in the data. The remedy is typically to either include the missing variable explicitly or adopt a modeling framework that can handle potential confounding factors.
How might you handle autocorrelation in a time-series context where residuals are not independent?
In time-series or other sequential data, consecutive observations are often correlated. Applying standard linear regression with the independence assumption is inappropriate in such scenarios. Instead, you can use time-series-specific models such as ARIMA or state-space models, or you might consider generalized least squares that explicitly models the covariance structure. These approaches incorporate correlation across time or consecutive observations, which corrects standard error estimates and improves predictive performance.
Could linear regression still be applied if the error distribution is non-Gaussian?
Yes. OLS can still provide unbiased estimates of coefficients even if errors are not strictly normal, assuming other assumptions (linearity, independence, homoscedasticity) hold. However, normality of errors is crucial for valid confidence intervals, t-tests, and other inference measures under the standard linear regression framework. When errors are far from normal, it is common to switch to robust regression techniques or transform the target variable so that residuals become more normal.
Python Code Example for Checking Assumptions
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
# Sample data
np.random.seed(0)
X = np.random.rand(100,1)*10
y = 3.5 * X[:,0] + 2 + np.random.randn(100)*2
# Create DataFrame
df = pd.DataFrame({'X': X[:,0], 'y': y})
# Fit OLS model using statsmodels
X_sm = sm.add_constant(df['X']) # adds the intercept term
model = sm.OLS(df['y'], X_sm).fit()
# Print summary
print(model.summary())
# Residual plot
residuals = model.resid
fitted = model.fittedvalues
plt.figure(figsize=(8,5))
sns.scatterplot(x=fitted, y=residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs Fitted Values')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()
# Q-Q plot to check normality of residuals
fig = sm.qqplot(residuals, line='45', fit=True)
plt.title('Q-Q Plot of Residuals')
plt.show()
In the above script, model.summary()
provides a thorough overview of regression results, including coefficient estimates and diagnostic statistics. The residual versus fitted plot, along with the Q-Q plot, helps assess linearity, homoscedasticity, and normality of errors.
Below are additional follow-up questions
How would you identify and handle outliers that might violate the linearity or normality assumptions?
Outliers can distort coefficient estimates and produce non-normal distributions of residuals. Visual diagnostics are typically the first step. A scatter plot of each predictor against the target, along with a residual-versus-fitted plot, can help detect anomalous points that deviate significantly from the overall pattern. Statistical tests such as Cook’s distance or leverage scores can also quantify the influence of individual data points on the fitted regression line.
When outliers are detected, you need to understand whether they arise from genuine extreme events or measurement errors. If they are measurement errors, it may be appropriate to remove or correct those data points. If they represent natural but extreme cases, consider using robust regression techniques (e.g., Huber or Tukey loss functions) that lessen the impact of large residuals. Alternatively, log or other transformations applied to the target or predictors can reduce the effect of extremely large values. In real-world scenarios, domain knowledge is critical to deciding whether an outlier should be kept, transformed, or removed.
A potential pitfall is routinely discarding outliers simply because they do not fit the model. This can lead to underestimating inherent variability in the data and a model that overstates its accuracy. Another subtle edge case is when a small number of outliers reveal an important subpopulation or changing trend that the model is failing to capture, indicating you might need a more flexible approach.
Does the scale of features affect the fitting of linear regression, and how might this interact with assumptions?
Ordinary least squares does not require feature scaling to obtain correct coefficient estimates. However, extremely large or small magnitudes in predictors can lead to numerical instability in matrix inversion, especially if those large-scale features are correlated. This can inflate standard errors or cause difficulties in interpreting the regression coefficients.
While scaling predictors (e.g., standardizing each feature to have zero mean and unit variance) does not change the model’s predictions, it can improve numerical stability, especially in cases with multiple correlated predictors. Scaling can also make regularization techniques (e.g., Ridge or Lasso) more consistent in how they penalize different predictors.
A pitfall is interpreting unscaled coefficients incorrectly. For example, if one feature is in centimeters and another is in kilometers, large coefficient magnitudes can be misleading. Another edge case is that scaled features can obscure domain-specific interpretability if you need to directly compare the effect of a one-unit change in each predictor on the target. Thus, you must strike a balance between improving numerical stability and preserving the interpretability of your regression coefficients.
Could polynomial or interaction terms help address violations of linearity, and how would you implement them?
Linear regression can incorporate polynomial terms or cross-product (interaction) terms to capture more complex relationships. For example, if the true relationship between a predictor x and the target y is quadratic, including x^2 as an additional predictor can approximate that curvature without fully discarding the linearity premise (it remains a “linear” model in the coefficients, just with transformed features).
You might decide to use polynomial expansions if a residual-versus-predictor plot shows systematic curvature. For interaction terms, you might suspect that the effect of one predictor depends on the level of another. Implementing these transformations can be done manually by adding x^2, x^3, or x1*x2 columns, or automatically via libraries like sklearn.preprocessing.PolynomialFeatures
in Python.
An important pitfall is the risk of overfitting if too high an order of polynomial terms is included or if you introduce numerous interaction terms without sufficient data. You also have to watch for increased multicollinearity because polynomial expansions often correlate strongly with the original features. Careful consideration, regularization, or domain knowledge can help mitigate these risks.
In what situations would you switch from linear regression to robust regression, and what assumptions might change?
Robust regression is chosen when you suspect that your data contain outliers or heavy-tailed error distributions that can unduly influence the ordinary least squares solution. Techniques like Huber regression or RANSAC regression adjust the loss function so that large errors do not dominate the fitting procedure.
Robust methods still assume a reasonably linear relationship between predictors and the target, but they relax the strict normality of errors assumption and reduce the sensitivity to outliers. If data exhibit skewness, heteroscedasticity, or a small number of extreme points, robust regression can provide more stable and interpretable coefficient estimates.
One pitfall is relying solely on robust regression when deeper issues exist, such as strong autocorrelation or severely misspecified models. In such cases, robust regression might hide an underlying violation of independence or linearity. Another subtlety arises when outliers are informative signals about subpopulations or regime changes; discarding their influence might remove insights that a domain-specific model could exploit.
How might a non-random or biased sampling process impact the assumptions of linear regression?
Linear regression assumes that the data come from a population of interest in a manner that does not systematically distort the relationships. If your sampling process is biased—certain subgroups are overrepresented or underrepresented—then the estimated coefficients might reflect the idiosyncrasies of the sample rather than the true underlying population.
A pitfall is that even if your linear model perfectly fits the sample, its predictions may generalize poorly. The significance tests and confidence intervals also assume that your sample variability is representative of real population variability, which is not guaranteed if, for instance, you used convenience sampling or data from a source with systematic bias.
Subtle edge cases involve temporal changes or dynamic populations. For instance, if you are training a model on data from one time period but applying it later when consumer behavior or economic conditions have shifted, your assumptions of representativeness and independence can break down. Weighting or post-stratification adjustments may partially mitigate these biases, but careful domain knowledge is crucial.
What is the relationship between R-squared, adjusted R-squared, and the assumptions of linear regression?
R-squared measures the proportion of the variance in the target that is explained by the model’s predictors. Adjusted R-squared modifies this metric by penalizing models that add non-informative predictors. Both metrics can give you a sense of how well your model captures overall variation.
Violations of assumptions (e.g., nonlinearity, influential outliers, heteroscedasticity) can mask or inflate R-squared values. A model with a high R-squared might still be inappropriate if the errors exhibit patterns indicating unmodeled dynamics. Also, a high adjusted R-squared does not guarantee the assumptions are valid; it only suggests that extra predictors meaningfully contribute to explaining variance.
A potential pitfall is using R-squared or adjusted R-squared as the sole metric to evaluate model performance. They do not account for overfitting in complex transformations, nor do they warn you about multicollinearity or non-normal errors. You should always examine residual plots, leverage and influence diagnostics, and domain considerations.
What if the data have limited range or are heavily skewed, and how does that affect linear regression?
When data have limited range—for instance, a predictor that only varies over a very narrow interval—there can be difficulties separating the effect of that predictor from random fluctuations in the noise. The slope estimate for such a feature might be imprecise, leading to large standard errors. Similarly, if the target variable itself has limited range or is bounded (e.g., percentages between 0 and 1), standard linear regression assumptions about homoscedasticity and normal errors may fail dramatically.
Heavy skewness in either the predictors or the target can lead to non-normal residuals and potentially heteroscedastic errors. Common remedies include using log, square-root, or other transformations that make the distribution more symmetric and reduce the influence of extreme values. Alternatively, generalized linear models (GLMs) might be more appropriate for bounded outputs or counts.
A subtlety is that transformations like log(y) can alter the interpretation of the coefficients: a unit change in x might now be associated with a multiplicative change in y. Be sure to verify that the transformed model fits domain knowledge and to communicate the coefficient meaning clearly.
How can you test for and incorporate variable interactions to ensure your model is properly specified?
Even if individual predictors have a straightforward linear relationship with the target, their joint effect might not be purely additive. For instance, the effect of temperature on ice-cream sales might be stronger at higher levels of disposable income. You can test for these interactions by including cross-product terms x1*x2 in the model and checking their statistical significance or evaluating improvements in model fit metrics.
Pitfalls include blindly adding all pairwise interactions when you have many predictors, which can lead to combinatorial explosions in the number of features. This can make the model overfit and lose interpretability. Another subtlety is that a significant interaction in a small dataset may be sensitive to outliers, so you need to confirm that the data truly support the interaction rather than random noise. Robust or regularized methods can help control for spurious interactions when dealing with many predictors.