ML Interview Q Series: How does the Recursive Feature Elimination (RFE) work?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Recursive Feature Elimination (RFE) is a technique that ranks features according to their importance to a particular predictive model and eliminates the least important features step by step. It repeatedly builds a model using the remaining features and discards those features deemed least valuable until it reaches a desired number of features or until some stopping criterion is met.
The procedure typically involves these steps in an iterative fashion:
Begin by training a model (for example, a linear or tree-based model) on the entire feature set. Use the model to compute some measure of feature importance, such as the absolute values of learned coefficients in a linear model or impurity-based importance in a tree-based model. Identify and remove the least important features (either one at a time or in groups). Retrain the model on the reduced set of features. Repeat this process until you reach a specified number of features or until you remove all but the most critical ones.
One central way to measure feature importance in a linear model is to look at the absolute magnitude of the learned weight for a given feature. If the model is defined as y = w^T x + b, each coefficient w_j indicates the effect of the j-th feature x_j on the prediction. Features with smaller absolute coefficients are generally considered less relevant. The measure of feature importance for the j-th feature in a linear model can be:
where w_j is the coefficient corresponding to the j-th feature. Larger absolute coefficients usually indicate higher importance; smaller absolute coefficients imply lower importance. After identifying those features with the smallest absolute coefficients, they are removed, and the model is retrained using the remaining features.
When using tree-based models for RFE, the feature importance can be derived from how frequently and how effectively a feature is used to split nodes within the trees (for instance, by total reduction in impurity or by average Gini impurity reduction).
The advantage of RFE is that it takes into account interactions between features in each iterative refit. By iteratively retraining, it dynamically adjusts which features appear most relevant at each stage, potentially capturing interactions that straightforward single-pass ranking might miss. However, it can also be computationally more expensive than one-shot feature selection methods, because it must repeatedly retrain a model.
Below is a short Python code snippet illustrating how you might use RFE with a linear model. This example uses scikit-learn:
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
# Load sample data
data = load_boston()
X, y = data.data, data.target
# Define a linear regression model
model = LinearRegression()
# Define how many features you want to select, say 5
rfe = RFE(estimator=model, n_features_to_select=5)
# Fit RFE
rfe.fit(X, y)
# Check which features were selected
selected_features = rfe.support_
ranking_of_features = rfe.ranking_
print("Selected features:", selected_features)
print("Ranking of features:", ranking_of_features)
This code automatically carries out the iterative elimination of features, refitting the model each time until the chosen number of features remain.
Why It Works
RFE is grounded in the intuition that when you remove the least important features and retrain, the model can readjust to the reduced feature space. This readjustment makes the rank ordering of the remaining features more accurate. Iterative retraining can be particularly beneficial if certain combinations of features overshadow the usefulness of others. Removing a few features might make previously overshadowed features appear more significant upon subsequent retraining.
Potential Challenges
One major challenge is computational cost. Each iteration involves model training, so if you have a large dataset or a computationally heavy model (like certain ensembles or neural networks), RFE can become quite time-consuming. Another subtlety is correlated features, which can sometimes lead to surprising importance estimates. If multiple features are correlated, the model might place all or most of the emphasis on one of them, mistakenly labeling the correlated others as unimportant.
When to Use RFE
RFE is particularly suitable when you have a modest number of features (tens or a few hundreds) and can afford multiple model retrainings. It is less ideal for very large feature spaces unless you have extensive computational resources. It also works well when your modeling approach provides a clear metric of feature importance. Linear and tree-based models are commonly used with RFE, but you can apply RFE to any estimator that yields a clear criterion for feature selection.
Follow-up Questions
Could you compare RFE with forward selection and backward elimination?
Forward selection starts with no features and adds the most important features one by one. Backward elimination starts with all features and removes the least important features one by one. RFE is a kind of backward elimination, but it differs in that it re-estimates feature importances at each step after removing features. Standard backward elimination might remove the least significant feature based on one global pass of coefficient p-values (in a statistical sense) or single-round importance. RFE, however, recalculates importance in a new retrained model every time it discards features, which often leads to a more robust subset, especially when features interact.
What if features are highly correlated?
When features are highly correlated, RFE might arbitrarily select one correlated feature over the other, even though both convey similar information. This can be problematic if you want to understand which feature is most truly predictive. One approach is to group correlated features before RFE and only keep one representative from each group. Another strategy is to use a model that can handle correlation or shrink coefficients appropriately (for instance, Lasso or Ridge regression). That way, the importance measure given to correlated features might be more balanced.
Can RFE be used with non-linear models or neural networks?
Yes, in principle. As long as the model has some well-defined notion of feature importance, you can use it as part of the RFE loop. For neural networks, one might compute feature importances by approximating how the loss changes when each feature is disturbed (such as using some version of input perturbation or gradient-based saliency). However, the computational overhead can be quite large, because neural networks are expensive to train, and RFE’s iterative retraining can multiply that cost.
How do we choose how many features to remove in each iteration?
There is no one-size-fits-all rule. Some implementations of RFE remove one feature at a time, which is the most fine-grained approach but can be slow if you have many features. Others remove a fixed fraction or fixed number of features at each step to speed up the process. The trade-off is precision versus computational efficiency. Removing one feature at a time can yield more precise results but takes more iterations. Removing multiple features can speed things up but might inadvertently discard some useful features.
Does RFE guarantee finding the optimal subset of features?
No. RFE does not guarantee finding the absolute best subset. It uses a greedy strategy of removing what appears to be the least important feature at each step. There can be cases where removing a seemingly unimportant feature early might lead to discarding certain beneficial interactions. It typically yields a good subset in practice, but it is still a heuristic. More exhaustive methods, like searching all subsets, can yield better results but are computationally impractical for large feature sets.
How to handle the computational cost of RFE on large datasets?
A practical approach is to combine RFE with a faster preliminary feature selection. You could first use a filter method (like removing features with low variance or high missing rates) to eliminate obviously irrelevant features. Then apply RFE on the smaller set. Alternatively, you might reduce the dimensionality using a method like PCA or autoencoders before applying RFE. These approaches can cut down the search space without relying solely on a brute-force RFE iteration for all features.
Below are additional follow-up questions
How do you choose a stopping criterion for RFE in practice?
One common stopping criterion is to stop removing features when you reach a specific number of features (for example, you may want only the top 10 features). Another approach is to use a metric such as validation accuracy or cross-validated performance to decide when to halt. You can remove features iteratively until the metric on a validation set stops improving.
A subtlety to watch out for is performance fluctuations caused by random noise or a small validation set. If your validation set is not large enough, you might see small but misleading improvements or declines in the metric. One safeguard is to use cross-validation over multiple folds and average the metric to make a more robust decision about when to stop. Another pitfall is ignoring domain knowledge. Sometimes domain expertise can inform you about a minimum set of features that must be included or maximum features that are computationally feasible in deployment. Deciding on that boundary without domain input can lead to suboptimal results.
How does RFE handle noisy features or outliers?
In scenarios where some features are heavily influenced by outliers or contain a lot of noise, the model’s importance measure can be unstable. For example, if using a linear model, a single outlier could alter the model’s coefficients drastically and mislead RFE into removing features that are actually useful. Similarly, in a tree-based model, noisy features might occasionally yield misleading splits that look important at a given iteration but do not generalize.
One approach is to apply robust scaling or transformations to reduce the effect of extreme values. Another tactic is to use robust estimators or regularized models (like Ridge or Lasso) that can stabilize the importance estimates. If a significant fraction of your features are noisy, you might also consider applying initial dimensionality reduction (like PCA) or a preliminary filter-based technique (like removing features with extremely low variance) before running RFE.
How does RFE behave in a multi-class or multi-label classification scenario?
In multi-class classification, the model’s importance measure typically becomes an aggregate across multiple classes. For linear models, you might have separate coefficient vectors for each class (in a one-vs-rest scheme), and you could aggregate the absolute values of coefficients across classes to form a single importance measure per feature. Tree-based models often compute importance by summing how each feature contributes to impurity reduction for each class. RFE then proceeds as usual, discarding features with the lowest aggregated importance.
In a multi-label classification scenario (where each sample can belong to more than one class simultaneously), you might need a specialized approach to measure feature importance, because you have multiple sets of predictions. One strategy is to calculate feature importance for each label independently and then aggregate (for instance, by summing or averaging the importance scores across labels). This approach can cause confusion if some features are extremely important to one label and less so for another, so you need to define a clear global measure to combine the label-specific importances. A major edge case is if the model architecture changes for multi-label tasks; not all methods provide straightforward feature importances in multi-label contexts.
How do you integrate cross-validation with RFE to find an optimal subset?
A robust way to integrate cross-validation is to wrap the entire RFE process inside a cross-validation loop. For each split, you run RFE and determine the best subset of features, then measure the performance on the validation fold. You aggregate performance across folds to ensure that the selected feature set is consistently good.
However, this can become computationally intense because each fold involves multiple model trainings (one per RFE iteration). One common practical compromise is to do a single pass of RFE on the full training set, then evaluate the chosen subset of features on a cross-validation basis. While less rigorous, it is significantly more efficient. Another subtlety is ensuring that you do not leak information between folds. You should fit RFE (including its internal importance computation) strictly within the training portion of each fold, so you do not inadvertently tune feature selection using validation data.
Can RFE be used in an online or streaming data context?
RFE is not inherently designed for online or streaming scenarios, because it iteratively retrains on the entire dataset. In streaming contexts, new data arrives continuously, and you often prefer incremental model updates without retraining from scratch. Applying RFE naively would require you to re-run feature importance calculations and remove features repeatedly as new data arrives, which may be computationally prohibitive.
Still, you can approximate RFE in streaming contexts by periodically recalculating feature importance using a portion of the new data—perhaps the most recent samples or a rolling window—and then deciding whether to keep or remove features. Edge cases arise when the data distribution shifts over time. A feature that was unimportant in earlier phases might become relevant later, but a straightforward RFE approach only ever removes features. To handle distribution shifts, you might consider a strategy that allows re-introduction of features or uses an adaptive approach such as dynamic feature weighting.
How does RFE handle highly imbalanced classification problems?
In highly imbalanced classification settings, the model’s importance metric can be skewed toward the majority class. For instance, in a linear model, if most samples are from one class, the coefficients may be driven by those samples, undervaluing features that could be crucial for the minority class.
A workaround is to use class-weighted models or sampling techniques (like oversampling the minority class or undersampling the majority class) during each iteration of RFE. This ensures that the importance calculations are more balanced across classes. Another subtlety is choosing an appropriate performance metric. Instead of accuracy, you might rely on F1-score or AUROC if you are more concerned with minority class detection. If the chosen metric is sensitive to the minority class, the feature importance measure used in RFE will also shift to reflect minority-class critical features.
How is RFE different from L1-regularized feature selection?
L1-regularized linear models (Lasso for regression or L1-regularized logistic regression for classification) can shrink certain coefficients toward zero, effectively performing feature selection in a single pass. RFE, on the other hand, is a stepwise method that iteratively removes the least important features and retrains.
In principle, both approaches can yield similar subsets of features, but the paths they take to get there differ. L1 regularization applies a penalty on the magnitude of the coefficients, leading to sparse solutions automatically. RFE explicitly discards features based on importance at each iteration. One subtlety is that L1 regularization can drop useful features if they happen to be correlated with others that get assigned larger coefficients, especially for large penalty terms. RFE might handle correlated features differently, retraining at each step. If a correlated feature is removed, the remaining one may grow in importance upon retraining.
How do you ensure the interpretability of RFE’s final set of features?
When interpretability is a priority, it is helpful to track the feature elimination path. Keep a record of which feature was removed at each iteration and why. This sequence can help stakeholders understand how the final subset was reached. If you are using linear models, you can examine the coefficients after each elimination round. For tree-based models, you can look at how the feature importance distribution changes over iterations.
A hidden pitfall is that some features might be critical from a domain perspective, even if the model finds them less important. Automatically removing domain-specific features without consulting domain experts can hamper interpretability and reduce trust. Another subtlety arises if your final model is complex, such as a random forest or gradient boosting machine. Even if you have fewer features, you can still face interpretability challenges. Tools like SHAP or partial dependence plots can help with post-hoc explanations.
What are the potential performance trade-offs of removing multiple features at once during RFE?
Removing multiple features in a single iteration can speed up the process significantly, especially if you start with many features. However, it also introduces a risk of eliminating some features that might have proven useful if the model were retrained more gradually. Large batch removals can overlook subtle interactions between features, because the importance of a given feature may appear small in the presence of certain other features and may grow once those others are removed.
A common compromise is to remove a small fraction of the features each time—perhaps 5% or 10%—so the process is more efficient than removing just one feature per iteration but not so aggressive that you discard valuable features prematurely. It is a balancing act between computational feasibility and the accuracy of your feature selection process. A hidden gotcha can arise if you set the batch size too large early in the process. You might quickly remove the most obviously redundant features but also lose some that need a closer, iterative re-check once certain dominating features are out of the picture.
Could you discuss integrating RFE with ensemble methods like Random Forest or XGBoost?
Random Forest and XGBoost both provide feature importance metrics, typically based on the total reduction in impurity or gain across all splits involving a particular feature. This can be directly plugged into an RFE loop: build an ensemble on the current set of features, compute feature importance, remove the least important features, and repeat.
However, one subtlety is that many splits in these ensemble methods might be near-ties in terms of gain, especially if you have noisy or correlated features. This can make importance rankings less reliable in the early iterations. Another edge case is that some features might never be chosen for splits if you use certain hyperparameters (like max_depth constraints or subsampling). If a feature never appears in a split, its importance might be reported as zero, prompting immediate removal. But it may still be valuable when combined with certain other features. Tuning ensemble hyperparameters before applying RFE can mitigate this pitfall by ensuring that the model is flexible enough to properly evaluate feature utility.