ML Interview Q Series: Explain the Stepwise Regression technique

Apr 07, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

Stepwise Regression is a variable-selection technique often used in multiple linear regression to determine a subset of predictors that create an optimal model according to some predefined criterion (often p-values, F-statistics, or an information criterion like AIC or BIC). It involves adding or removing predictor variables iteratively, evaluating model performance at each step, until a stopping condition is met. There are three main flavors:

Connect with me on X (Twitter)

Forward Selection: Start with no predictors and progressively add predictors that improve the model fit the most, stopping when no additional predictor contributes significantly to the model.
Backward Elimination: Start with all predictors and progressively remove those that contribute the least to the model, stopping when all remaining predictors meet the chosen significance criterion.
Bidirectional Elimination: Combine forward addition and backward removal in the same procedure, continuously adjusting the set of predictors.

The fundamental idea is to balance model complexity and explanatory power, avoiding an overly large model (which risks overfitting) or an overly small model (which may underfit and fail to capture important relationships).

Underlying Rationale and Statistical Tests

To evaluate the importance of adding or removing a particular predictor, stepwise methods typically rely on statistics like the partial F-test, t-tests on regression coefficients, or certain penalized metrics. One common approach is the partial F-test, used to compare two models (a reduced model without a certain predictor versus a full model with it). The partial F statistic can be represented by:

where:

RSS_reduced is the residual sum of squares for the reduced model (the model without the predictor in question).
RSS_full is the residual sum of squares for the full model (the model including the predictor).
df_reduced is the degrees of freedom of the reduced model.
df_full is the degrees of freedom of the full model.

If the F statistic is large enough (i.e., the p-value is below a threshold), it suggests that adding (or keeping) the predictor significantly improves model fit.

When you are adding variables (forward selection), you check which variable’s inclusion yields the biggest improvement in model fit. When you are removing variables (backward elimination), you typically remove whichever predictor has the least significant contribution to the model based on a threshold for the F-test or t-test. In bidirectional methods, you alternately add and remove predictors at each step.

Detailed Mechanics

Forward Selection:
1. Begin with an empty model.
2. Among all predictors not yet included, find the one that leads to the greatest improvement in the model fit if added.
3. If that improvement is statistically significant (based on your chosen criterion), add the predictor.
4. Repeat until no remaining predictor significantly improves the model.
Backward Elimination:
1. Begin with all candidate predictors.
2. Check the least significant predictor (often judged by the highest p-value or smallest absolute t-value).
3. If it fails a significance criterion, remove it from the model.
4. Repeat until all remaining predictors have significant contributions.
Bidirectional Selection:
1. Start as in forward selection, but after adding a predictor, check whether any predictors in the model should be removed (backward check).
2. Continue until no more variables can be added or removed based on significance thresholds.

In practice, you can also use metrics like Akaike’s Information Criterion (AIC) or Bayesian Information Criterion (BIC) to decide whether adding/removing a predictor helps or hurts. The process iterates until an optimal stopping point is reached, which is usually when no addition or removal action passes the threshold criterion.

Practical Considerations and Pitfalls

Multicollinearity: Highly correlated predictors can lead to unstable coefficient estimates. In stepwise regression, one correlated predictor might overshadow another, potentially excluding some variables that still hold relevant information.
Overfitting: If there are many predictors, stepwise regression (especially if purely based on p-values) can lead to overfitting. Cross-validation can help mitigate this by evaluating how well the selected model generalizes to unseen data.
Model Interpretability: Including or excluding variables based solely on significance can shift interpretability. Sometimes a predictor might not appear significant alone but becomes significant in combination with others.
Order Sensitivity: In forward selection, once you choose a predictor, you rarely revisit that choice if you keep a simple forward approach. This can lead to suboptimal sets of predictors, since a different order of selection might lead to a better final subset.
Computational Complexity: Stepwise procedures require iterative fitting of regression models. With a very large number of predictors, it can become computationally expensive, though typically less expensive than exhaustively searching all subsets.

Example Python Implementation

Below is a simple illustration of how one might implement a forward stepwise selection procedure in Python using statsmodels:

import pandas as pd
import statsmodels.api as sm
import numpy as np

def forward_stepwise_selection(data, response, predictors=None, p_value_threshold=0.05):
    """
    data: pandas DataFrame containing the features and the response
    response: name of the response variable (string)
    predictors: list of predictor variable names (optional). If None, uses all columns except response
    p_value_threshold: significance level threshold for adding predictors
    """
    if predictors is None:
        predictors = list(data.columns)
        predictors.remove(response)

    selected_predictors = []
    remaining_predictors = predictors.copy()
    current_score = float('inf')

    while remaining_predictors:
        scores_with_candidates = []
        for candidate in remaining_predictors:
            model_predictors = selected_predictors + [candidate]
            X = data[model_predictors]
            y = data[response]
            X = sm.add_constant(X)
            model = sm.OLS(y, X).fit()
            # Using AIC as the criterion
            score = model.aic
            scores_with_candidates.append((score, candidate))

        scores_with_candidates.sort(reverse=False)
        best_score, best_candidate = scores_with_candidates[0]

        if best_score < current_score:
            current_score = best_score
            selected_predictors.append(best_candidate)
            remaining_predictors.remove(best_candidate)
        else:
            break

    return selected_predictors

# Example usage:
# df is a pandas DataFrame with columns ['y', 'x1', 'x2', 'x3', ...]
# chosen_predictors = forward_stepwise_selection(df, 'y')
# print("Selected Predictors:", chosen_predictors)

In this snippet, we use the AIC criterion to decide whether a newly added predictor improves the model. We sort by AIC in ascending order (lower is better). Forward selection stops when no predictor’s addition yields a better AIC than the current score.

What are the key differences among Forward Selection, Backward Elimination, and Bidirectional Elimination?

Forward selection starts with an empty set of predictors, adding one at a time. Backward elimination starts with the full set of candidate predictors, removing one at a time. Bidirectional combines these strategies by checking at each iteration whether any included predictors should be removed (backward step) and whether any excluded predictors should be added (forward step). The key difference is in the direction in which you search the predictor space, and therefore bidirectional can sometimes find a better subset more flexibly than purely forward or purely backward approaches.

Why might Stepwise Regression be problematic with correlated features?

When features are correlated, stepwise selection can arbitrarily choose one feature over another, even if the other also contains valuable predictive information. This can produce:

Unstable coefficients: Minor variations in the data can lead to big changes in which features are kept.
Potential omission of relevant predictors: Some correlated variables may never enter the model because a correlated predictor is already in the set and is deemed sufficient by the selection criterion.
Over or underestimation of effect sizes: The presence or absence of correlated features can inflate or deflate the estimated coefficients.

It is often better to inspect correlation structures beforehand, or consider dimensionality-reduction methods (like Principal Component Analysis) or regularization-based approaches (like Ridge or Lasso) that can handle correlated features more robustly.

How does Stepwise Regression compare to Best-Subset Selection?

Best-subset selection tries all possible subsets of predictors, evaluating each model based on a criterion (like AIC, BIC, or adjusted R^2), and then picks the best-scoring model. It is more exhaustive than stepwise methods. However, best-subset selection can be computationally infeasible for a large number of predictors (there are 2^p possible subsets if p is the number of predictors). Stepwise methods, on the other hand, iteratively search through the space of candidate models in a greedy manner, making them less computationally expensive but not guaranteed to find the absolute best subset.

If you have a large dataset with many predictors, how do you handle Stepwise Regression?

In large-scale settings:

Computational Efficiency: A naive stepwise approach (forward or backward) can still be time-consuming if p (number of predictors) is very large. You may need more optimized search strategies or rely on approximate criteria (e.g., partial correlation filtering, or use strong significance thresholds).
Regularization: Methods like Lasso (L1-regularized regression) perform implicit variable selection and scale more gracefully to high-dimensional data. They often serve as more robust alternatives to stepwise selection and are straightforward to implement.
Screening: You can use domain knowledge or quick univariate screening to narrow down your candidate predictors. For instance, remove any predictor that has near-zero variance, or that shows no correlation to the target in preliminary tests.
Dimensionality Reduction: Combining stepwise regression with dimensionality-reduction techniques (like PCA) can help if the primary goal is prediction rather than interpretability.

In summary, stepwise regression can be a quick heuristic to select or eliminate predictors, but it has limitations, especially with high-dimensional, correlated data. Understanding these limitations and using appropriate validation techniques or regularization methods is crucial to ensure a stable and generalizable final model.

Below are additional follow-up questions

How can outliers impact the process and results of Stepwise Regression?

Outliers can have an exaggerated influence on which predictors are included or excluded from the model, because least-squares regression is quite sensitive to extreme values. If a single data point exerts a large influence on the slope estimates and significance levels, Stepwise Regression might interpret that as strong evidence in favor of including certain predictors or removing others. In real-world situations where measurement errors or anomalies are possible, this can cause the model to be shaped by random noise rather than true signal.

One subtle pitfall is that you might repeatedly “add” a predictor because it fits a handful of outliers perfectly. Once that predictor is in the model, it might mask the significance of another variable, because partial correlation structures change when new predictors enter. Similarly, if you perform backward elimination in the presence of outliers, variables that are genuinely important could be dropped because the outliers diminish their statistical significance. To deal with this:

Investigate potential outliers with diagnostic plots or leverage/influence metrics (for example, Cook’s distance).
Consider robust regression (e.g., using robust loss functions) or data transformation (e.g., log transformations) to reduce the effect of extreme observations.
Cross-validate models with and without suspected outliers to see if predictor selection remains stable.

In what scenarios might we prefer Stepwise Regression over regularization methods like Lasso or Ridge?

Stepwise Regression might be appealing in scenarios where:

Interpretability is paramount, and you want a relatively small set of predictors with clear significance tests. Stepwise methods yield a final model with a smaller subset of variables that are easier to interpret from a classical hypothesis-testing perspective.
Very small or moderate number of predictors relative to sample size, making it computationally feasible to iterate through potential subsets.
Strong domain knowledge that allows you to watch each step of selection more carefully. You might use a domain-guided approach, adding or removing variables while manually verifying the result’s plausibility at each step.

However, for very high-dimensional problems, Lasso often becomes more practical. Lasso’s L1 penalty continuously shrinks coefficients, forcing some to zero and effectively performing automatic feature selection. Ridge regression is also helpful when many correlated features are present because it stabilizes coefficient estimates through shrinkage. These approaches typically:

Are more stable when features are correlated (although correlation can still be tricky for Lasso).
Scale to large numbers of predictors more effectively.
Provide a single, more elegant approach to variable selection by tuning a regularization parameter with cross-validation.

Hence, Stepwise Regression is occasionally preferred for interpretability or for instructive, iterative variable selection, but regularization methods usually work better in large-scale or collinear scenarios.

How do we decide on the significance threshold or stopping criterion in Stepwise Regression?

Choosing the threshold is typically a balance between false positives (including predictors that do not truly matter) and false negatives (excluding predictors that are genuinely important). The traditional choice of 0.05 for p-values is somewhat arbitrary and may not always be the best for your domain or dataset. In practice:

Lowering the threshold (e.g., 0.01) can reduce false positives but risks excluding important predictors that barely miss the stricter cutoff.
Increasing the threshold (e.g., 0.10) might include more predictors, but that can lead to overfitting and inflated Type I error.
Information Criteria (AIC/BIC) provide an alternative to p-values, balancing model fit and complexity. BIC is stricter and often yields smaller models, while AIC is more lenient.

A common pitfall is to keep adjusting the threshold to see if a variable “just misses” significance. This is akin to p-hacking and can invalidate statistical inferences. It is also a best practice to confirm your final model with cross-validation or hold-out sets, ensuring that decisions made based on thresholds still generalize to unseen data.

How do we handle interactions among predictors when performing Stepwise Regression?

Interactions arise when the effect of one predictor depends on the value of another. Stepwise methods can technically accommodate interactions, but you have to explicitly include interaction terms as potential predictors. For example, if you suspect x1 interacts with x2, you add x1*x2 as a separate predictor in your candidate list. Then the stepwise procedure can decide whether to include or remove that interaction term based on its incremental contribution.

Pitfalls include:

Explosive growth in candidate predictors: If you have many predictors, the number of possible pairwise (and higher-order) interaction terms can become huge, making stepwise search very cumbersome.
Interpretation complexity: Even if an interaction term is selected, interpreting it in a classical sense can become difficult, especially if higher-order interactions are significant (like x1x2x3).
Hierarchy principle: Some statisticians prefer to include lower-order terms whenever an interaction is included (for example, keep x1 and x2 if x1x2 is in the model), but a purely stepwise approach might drop x1 or x2 while retaining x1x2, leading to interpretive oddities.

A typical recommendation is to base your choice of interactions on domain knowledge, or to rely on more flexible modeling approaches (like tree-based methods) if you anticipate complex interactions but want an automated approach.

How can we verify the stability of the variable selection obtained via Stepwise Regression?

To assess stability, you can perform:

Cross-Validation: Split the data into multiple folds. Perform the stepwise procedure on each training fold. Compare the selected models across folds. If the selected sets of predictors vary drastically, it indicates instability.
Bootstrap: Resample your dataset with replacement multiple times, apply the stepwise procedure to each bootstrap sample, and track how often each predictor is selected. If certain predictors are consistently selected, that suggests a stable choice.
Perturbation Tests: Introduce slight random noise in your training set (e.g., slight jitter in some observations) and see if the same subset of predictors emerges.

An unstable model might flip between including or excluding certain predictors when the data changes slightly, which is usually a sign that either the variables are highly correlated, or the signal from those predictors is not strong enough to be confidently detected. In real-world settings, interpret these results to decide whether you need more data, a different modeling approach, or more robust variable selection methods.

What if Stepwise Regression selects zero predictors?

It is possible that none of the candidate predictors pass the chosen significance threshold when using a purely statistical criterion. This usually indicates:

Weak relationship between the predictors and the target. Your dataset might not contain strong signals, or the effect sizes are too small relative to your sample size.
Overly stringent threshold that excludes variables that might still provide value in practice.
Data quality issues such as a small sample size, measurement errors, or incorrectly specified relationships (e.g., ignoring non-linearities or interactions).

In such situations, you may need to:

Consider transformations or non-linear models. If your relationship is not linear, standard linear stepwise regression can miss real signals.
Gather more data to improve statistical power.
Relax your threshold or use an information criterion to check if a small improvement from a predictor is still beneficial.

How can domain knowledge be incorporated effectively into Stepwise Regression?

Domain knowledge often informs decisions about:

Variable Inclusion/Exclusion: You might have strong prior beliefs that certain variables are crucial and should always remain in the model, or that certain variables are irrelevant. You can “lock in” or “lock out” specific predictors, preventing stepwise from inadvertently dropping or adding them.
Interaction Terms: You might know that certain features in your domain have synergistic or multiplicative relationships. Stepwise might only detect this if you explicitly add interaction terms, so domain expertise helps identify plausible interactions.
Threshold Tuning: If domain literature suggests that a predictor’s effect is meaningful even if the p-value is moderately above 0.05, you can adjust the significance threshold accordingly. Conversely, if you’re dealing with a critical safety-related model, you might want a very strict threshold to avoid false positives.

A common pitfall is blindly applying a stepwise algorithm without leveraging expert input. As a result, important variables can be dropped, or spurious predictors can be chosen, leading to models that are statistically well-fitting but practically suspect. Incorporating domain knowledge at each step helps ensure that the final model not only fits the data but also makes sense in context.

Rohan's Bytes

Discussion about this post