ML Interview Q Series: How does the Recursive Feature Elimination (RFE) work?

Apr 07, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

Recursive Feature Elimination (RFE) is a technique that ranks features according to their importance to a particular predictive model and eliminates the least important features step by step. It repeatedly builds a model using the remaining features and discards those features deemed least valuable until it reaches a desired number of features or until some stopping criterion is met.

Connect with me on X (Twitter)

The procedure typically involves these steps in an iterative fashion:

Begin by training a model (for example, a linear or tree-based model) on the entire feature set. Use the model to compute some measure of feature importance, such as the absolute values of learned coefficients in a linear model or impurity-based importance in a tree-based model. Identify and remove the least important features (either one at a time or in groups). Retrain the model on the reduced set of features. Repeat this process until you reach a specified number of features or until you remove all but the most critical ones.

One central way to measure feature importance in a linear model is to look at the absolute magnitude of the learned weight for a given feature. If the model is defined as y = w^T x + b, each coefficient w_j indicates the effect of the j-th feature x_j on the prediction. Features with smaller absolute coefficients are generally considered less relevant. The measure of feature importance for the j-th feature in a linear model can be:

where w_j is the coefficient corresponding to the j-th feature. Larger absolute coefficients usually indicate higher importance; smaller absolute coefficients imply lower importance. After identifying those features with the smallest absolute coefficients, they are removed, and the model is retrained using the remaining features.

When using tree-based models for RFE, the feature importance can be derived from how frequently and how effectively a feature is used to split nodes within the trees (for instance, by total reduction in impurity or by average Gini impurity reduction).

The advantage of RFE is that it takes into account interactions between features in each iterative refit. By iteratively retraining, it dynamically adjusts which features appear most relevant at each stage, potentially capturing interactions that straightforward single-pass ranking might miss. However, it can also be computationally more expensive than one-shot feature selection methods, because it must repeatedly retrain a model.

Below is a short Python code snippet illustrating how you might use RFE with a linear model. This example uses scikit-learn:

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

# Load sample data
data = load_boston()
X, y = data.data, data.target

# Define a linear regression model
model = LinearRegression()

# Define how many features you want to select, say 5
rfe = RFE(estimator=model, n_features_to_select=5)

# Fit RFE
rfe.fit(X, y)

# Check which features were selected
selected_features = rfe.support_
ranking_of_features = rfe.ranking_

print("Selected features:", selected_features)
print("Ranking of features:", ranking_of_features)

This code automatically carries out the iterative elimination of features, refitting the model each time until the chosen number of features remain.

Why It Works

RFE is grounded in the intuition that when you remove the least important features and retrain, the model can readjust to the reduced feature space. This readjustment makes the rank ordering of the remaining features more accurate. Iterative retraining can be particularly beneficial if certain combinations of features overshadow the usefulness of others. Removing a few features might make previously overshadowed features appear more significant upon subsequent retraining.

Potential Challenges

One major challenge is computational cost. Each iteration involves model training, so if you have a large dataset or a computationally heavy model (like certain ensembles or neural networks), RFE can become quite time-consuming. Another subtlety is correlated features, which can sometimes lead to surprising importance estimates. If multiple features are correlated, the model might place all or most of the emphasis on one of them, mistakenly labeling the correlated others as unimportant.

When to Use RFE

RFE is particularly suitable when you have a modest number of features (tens or a few hundreds) and can afford multiple model retrainings. It is less ideal for very large feature spaces unless you have extensive computational resources. It also works well when your modeling approach provides a clear metric of feature importance. Linear and tree-based models are commonly used with RFE, but you can apply RFE to any estimator that yields a clear criterion for feature selection.

Follow-up Questions