ML Interview Q Series: Why is Logistic Regression frequently referred to as a linear model?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Logistic Regression is often categorized as a linear model because, fundamentally, it applies a linear combination of the input features before passing that sum through a non-linear activation (the logistic or sigmoid function). Even though the final output appears to be non-linear (due to the sigmoid transformation), the core relationship between the features and the “log odds” is linear.
The primary equation in Logistic Regression can be seen in its log-odds form:
Here, p is the predicted probability of the positive class, w is the weight vector, x is the feature vector, and b is the bias term. This expression shows that the log-odds (the ratio of the probability of belonging to the positive class over the probability of belonging to the negative class) is modeled as a linear function of the inputs. Despite the final prediction being squeezed through the sigmoid function, the model still belongs to the broader family of linear models due to this linearity in log-odds space.
When making classification decisions (e.g., classifying an input as positive or negative), the decision boundary occurs where p = 0.5. Substituting p = 0.5 into the sigmoid function and solving for w^T x + b = 0 reveals that the boundary in feature space is a hyperplane described by a linear equation. This underlies the “linear” classification boundary characteristic of Logistic Regression.
Another way to see this linearity is through interpretation of the coefficients. Each weight in the model corresponds to the degree and direction of influence that a particular feature has on the log-odds of the target class. The linear combination of these inputs is then mapped by the logistic function into a probability. Hence, the model is considered linear in parameters w and b, even though the output is ultimately passed through a non-linear sigmoid.
This perspective has several important implications for how Logistic Regression behaves, how it can be regularly extended (e.g., using regularization terms like L1 or L2 that act linearly on the parameters), and why it has certain biases and limitations if the true relationship between features and target is significantly non-linear (unless we engineer non-linear features or apply methods like polynomial expansions).
Relation to Linear Regression
In traditional linear regression, the model predicts a continuous value y = w^T x + b. Logistic Regression modifies this by interpreting w^T x + b as the log-odds and then uses the logistic (sigmoid) function to map it to a probability between 0 and 1. The linear relationship, however, remains at the core of how the features combine.
A Quick Illustration in Code
import numpy as np
from sklearn.linear_model import LogisticRegression
# Suppose we have a simple dataset:
X = np.array([[0], [1], [2], [3], [4], [5]]).reshape(-1, 1)
y = np.array([0, 0, 0, 1, 1, 1])
# Fit a logistic regression model
model = LogisticRegression()
model.fit(X, y)
# Check the learned coefficient and intercept
print("Coefficient (w):", model.coef_)
print("Intercept (b):", model.intercept_)
# Predictions
predictions = model.predict([[1.5], [2.5], [3.5]])
print("Predicted classes:", predictions)
The model here learns a single weight (coefficient) and a bias (intercept) in a linear manner. These parameters represent how changes in the single input feature affect the log-odds of being in the positive class. The logistic function simply transforms that linear combination into a probability.
Potential Follow-up Questions
What makes the model “linear” when the output is passed through a non-linear activation?
The linearity of Logistic Regression resides in the link function between the probability and the input features, specifically the log-odds. Even though the final mapping to probability is a sigmoid function (which is non-linear), the fundamental combination of weights and features is a linear equation that forms the decision boundary.
How does the decision boundary end up being linear?
The decision boundary is found where p = 0.5. Substituting this into the sigmoid function implies w^T x + b = 0. That is a linear equation in terms of the input features x. Hence, the boundary that separates one class from another is a straight line (in 2D) or a hyperplane (in higher dimensions).
How does regularization affect the linearity of Logistic Regression?
Regularization (L1 or L2) adds a penalty on the magnitude of the weights to control model complexity but does not introduce non-linear interactions among features by itself. The model remains linear in the sense that features are still combined using a dot product. If we want to introduce explicit non-linearity, feature engineering or kernel methods would be required.
Why not simply use linear regression for classification?
Linear Regression is not ideal for classification because:
Its output is unbounded, making it poorly suited for a probability interpretation.
Outliers in the feature space can heavily influence the line of best fit.
It does not inherently provide a proper decision boundary in a probabilistic sense.
Logistic Regression addresses these issues by constraining outputs between 0 and 1 and providing a well-defined probabilistic interpretation through the log-odds.
Do we need feature scaling for Logistic Regression?
Feature scaling can help with optimization speed and convergence, especially when using gradient-based solvers. It does not affect the linearity of the model but can stabilize training so that no single feature dominates due to its scale alone.
How do we interpret the coefficients of Logistic Regression?
Each coefficient is interpreted as the effect of its corresponding feature on the log-odds of the target outcome. Specifically, a one-unit increase in a given feature corresponds to an additive change in log-odds by an amount equal to that feature’s learned coefficient, holding other features constant. Interpreted in probability terms, it translates into a multiplicative effect on the odds of the outcome.
Does class imbalance break the linear assumption?
Class imbalance does not break linearity. However, extremely imbalanced datasets can lead to poor parameter estimates and biased decision boundaries. Techniques such as oversampling, undersampling, or appropriate class-weight adjustments help the model focus on the minority class without altering the linear form of the model.
What if the true decision boundary is highly non-linear?
Standard Logistic Regression might struggle when the real-world relationship between features and the target is significantly non-linear. You can introduce polynomial or other non-linear transformations of the features (manual feature engineering) or switch to models like Neural Networks, Random Forests, or Kernel SVMs. Nonetheless, these approaches go beyond the basic linear structure of Logistic Regression itself.