What is Multicollinearity and How to treat collinearity ?
Multicollinearity occurs when there is a high correlation between the independent variables in the regression analysis which impacts the overall interpretation of the results. It reduces the power of coefficients and weakens the statistical measure to trust the p-values to identify the significant independent variables. Hence, we would not be able to examine the individual explanation of the independent variables on the dependent variable.
Multicollinearity is one of the main assumptions that need to be ruled out to get a better estimation of any regression model
Why Multi-Collinearity is a problem ?
It would be hard for you to choose the list of significant variables for the model if the model gives you different results every time.
Coefficient Estimates would not be stable and it would be hard for you to interpret the model. In other words, you cannot tell the scale of changes to the output if one of your predicting factors changes by 1 unit.
The unstable nature of the model may cause overfitting. If you apply the model to another sample of data, the accuracy will drop significantly compared to the accuracy of your training dataset.
Depending on the situation, it may not be a problem for your model if only slight or moderate collinearity issue occurs. However, it is strongly advised to solve the issue if severe collinearity issue exists(e.g. correlation >0.8 between 2 variables or Variance inflation factor(VIF) >20 )
When NOT fixing Multicollinearity is OK
It depends on the primary goal of the regression model.
The degree of multicollinearity greatly impacts the p-values and coefficients but not predictions and goodness-of-fit test
So, If your goal is to perform the predictions and not necessarily to understand the significance of the independent variable, then it is not mandatory to fix the multicollinearity issue.
Lets understand this statement
"The degree of multicollinearity greatly impacts the p-values and coefficients but not predictions and goodness-of-fit test"
p-value: a p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct. high p-values of a single predictor means they are not significant
Predictions: These are the model's outputs or the dependent variable values the model computes.
Goodness-of-fit test: This measures how well a statistical model fits a set of observations.
Now check out below 2 points of the impact of Multi-Collinearity
Impact on p-values and coefficients:
Impact on predictions and goodness-of-fit
Impact on p-values and coefficients:
When multicollinearity exists, it can inflate the variance of the regression coefficients. This means the estimates of the coefficients can be unstable and vary significantly for small changes in the data, which could lead to wider confidence intervals.
Lets understand why this is so.
When we run a multiple regression analysis, the model tries to estimate the impact of each independent variable on the dependent variable while holding the other independent variable constant. This task becomes challenging because of the existence of Multicollinearity.
In reality, the regression coefficients are calculated using an algorithm that minimizes the residual sum of squares (RSS). When variables are highly correlated, there are many possible combinations of coefficients that could result in an almost equally low RSS. In other words, there's a large number of "equally good" models with slightly different interpretations, which leads to high variability in the estimated coefficients.
This is what we mean by "inflating the variance of the regression coefficients". The coefficients become sensitive to small changes in the model or the data, which results in unstable estimates.
This instability is reflected in higher standard errors for the coefficients, leading to wider confidence intervals. Consequently, the p-values for these coefficients might be larger than expected, causing you to fail to reject the null hypothesis (that these coefficients are zero), even when these variables are truly predictive. This is why multicollinearity can make the results of a regression analysis misleading.
( Here, note the p-value for each variable tests the null hypothesis that the variable's coefficient is zero. high p-values of a single predictor means they are not significant )
Further, with multicollinearity, you can get a situation where the individual p-values are not significant, even though the overall F-test is significant.
This happens because the p-value for each variable tests the null hypothesis that the variable's coefficient is zero, given that all the other variables are in the model. High p-values of a single predictor means they are not significant. If two variables are highly correlated, each one will not contribute much to the explanation of the dependent variable, over and above what the other variable contributes. This situation leads to high p-values for these coefficients.
Explain - "with multicollinearity, you can get a situation where the individual p-values are not significant, even though the overall F-test is significant."
When we talk about p-values in the context of regression analysis, we're referring to the p-values associated with each individual predictor or coefficient in the model. A low p-value (typically below 0.05) for a coefficient suggests that this predictor is statistically significant in explaining the variation in the target variable, given the presence of other predictors in the model.
An F-test, on the other hand, is used to determine whether a group of variables as a whole significantly affects the dependent variable. This is typically used to test the overall significance of a regression model.
Now, when two or more of your predictors are highly correlated, these predictors might individually have high p-values (i.e., they are not significant), but the model as a whole (tested by the F-statistic) could still be significant.
Why does this happen?
This is because the predictors, while highly correlated, jointly have a strong predictive power over the target variable. However, because of the correlation between them, the model has a hard time attributing this predictive power to each of them individually, hence the high individual p-values.
In other words, the combined predictive power of these correlated variables is significant, but when we attempt to isolate their individual contributions, it becomes unclear (due to the shared variance between them) which predictor is truly responsible for the prediction, thus leading to higher p-values for the individual predictors.
This can lead to confusing situations where the overall model appears to be significant (based on the F-statistic), but none of the individual predictors appear to be significant (based on their p-values). This is a symptom of multicollinearity in the model and is a cue that you may need to address this issue for the model to be reliable and interpretable.
Impact on predictions and goodness-of-fit:
Despite these issues, the overall prediction of the model can remain accurate. This is because multicollinearity affects the specific parameters of the model (i.e., the coefficients and their interpretation), but the combination of predictors can still produce a good fit to the data.
Similarly, a goodness-of-fit statistic, like R-squared, only cares about the total variance explained by the predictors, not how that total is divided up among the predictors. Therefore, even in the presence of multicollinearity, the R-squared value can still be high if the correlated predictors jointly explain the response variable well.
In conclusion, while multicollinearity can have a significant impact on the interpretation of a model (p-values and coefficients), it does not necessarily harm the model's predictive power or its overall fit to the data. However, this doesn't mean that multicollinearity should be ignored as it weakens the statistical power of the model, and can lead to misleading interpretations. It's typically better to address multicollinearity when identified, using methods such as removing variables, combining variables, or using techniques like PCA or Ridge Regression
.