ML Interview Q Series: How would you implement logistic regression in a fully vectorized manner?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Vectorized implementations of logistic regression allow us to process all training examples in a single set of matrix operations, avoiding explicit loops for computing gradients and predictions for each training example. This approach leverages efficient linear algebra libraries and hardware optimizations.
Data Representation
Suppose we have m training examples, each with n features. We organize these into:
A design matrix X of shape (m x n). Each row corresponds to one example, each column to a particular feature.
A label vector y of shape (m x 1). Each entry y_i is 0 or 1, indicating the binary class label of the i-th example.
A parameter vector theta of shape (n x 1). These are the weights we want to learn.
Hypothesis Function
We use the logistic (sigmoid) function to transform a linear combination of inputs into a probability estimate. In a vectorized manner, the hypothesis vector hat{y} (shape: m x 1) can be expressed as the element-wise application of the sigmoid to the matrix multiplication X theta. This is typically done with a single function call in numerical libraries (e.g., torch.sigmoid
or tf.sigmoid
).
Cost Function
The usual objective function for logistic regression is the average cross-entropy loss. In vector form, we combine all training examples:
Here:
m is the number of training examples.
y^{(i)} is the true label (0 or 1) for the i-th example.
hat{y}^{(i)} is the predicted probability for the i-th example, given by the sigmoid applied to the i-th row of X times theta.
Gradient Computation
The gradient of the cost function with respect to theta can be computed in a fully vectorized manner:
Where:
hat{y} - y is an m x 1 vector containing the difference between the predicted probabilities and the actual labels for each example.
X^\top is of shape (n x m), so the resulting gradient will be of shape (n x 1), matching the dimension of theta.
Parameter Update Rule
In gradient descent, we iteratively update theta by subtracting the gradient scaled by the learning rate alpha:
theta := theta - alpha * (1/m) * X^T( (hat{y} - y) )
Implementation in Python (Illustrative Code)
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def logistic_regression_train(X, y, alpha=0.01, num_iters=1000):
m, n = X.shape
theta = np.zeros((n, 1)) # Initialize parameters
for _ in range(num_iters):
# Compute predictions
y_pred = sigmoid(np.dot(X, theta))
# Compute the gradient
grad = (1/m) * np.dot(X.T, (y_pred - y))
# Update parameters
theta = theta - alpha * grad
return theta
# Example usage:
# X is (m x n), y is (m x 1)
X = np.array([[0.1, 0.3],
[0.2, 0.4],
[0.3, 0.7],
[0.4, 0.9]])
y = np.array([[0],
[0],
[1],
[1]])
trained_theta = logistic_regression_train(X, y, alpha=0.1, num_iters=1000)
print("Learned Parameters:\n", trained_theta)
This code outlines a straightforward version of the vectorized training process. In practice, frameworks like PyTorch or TensorFlow provide built-in gradient computation and can handle this optimization under the hood.
Potential Follow-Up Questions
What happens if our data is not scaled properly?
Improperly scaled features can slow down gradient-based optimization. For example, if some features are on a scale of 0–1, while others are in the thousands, the gradient descent steps can oscillate. This effect becomes more pronounced for larger datasets. A common fix is feature normalization or standardization (e.g., transforming features so each has mean 0 and standard deviation 1).
How does regularization fit into the vectorized formulation?
Regularization is typically added as an extra term in the cost function, for example an L2 penalty term lambda/(2m) sum(theta_j^2). In the vectorized gradient, we add (lambda/m)*theta for L2 (excluding the bias term if present). This ensures that weight values do not grow too large, helping control overfitting.
Why might we prefer batch gradient descent over stochastic methods, or vice versa?
Batch Gradient Descent (BGD) processes all examples in one pass to compute an update. It can converge smoothly but might be slow for very large datasets.
Stochastic Gradient Descent (SGD) updates parameters using a single example at a time, giving faster initial convergence but more fluctuations in parameter updates.
Mini-batch Gradient Descent strikes a balance between the two by using batches (e.g., 32, 64, or 128 examples). It usually offers better computational efficiency on modern hardware (due to vectorized mini-batch operations) and more stable convergence than pure SGD.
Can logistic regression model non-linear decision boundaries?
By itself, logistic regression can only model linear decision boundaries in the original feature space. However, if we augment X with polynomial features or other transformations (like kernel mappings), logistic regression can capture non-linear boundaries. This approach is sometimes referred to as feature engineering for logistic regression.
How do we handle class imbalance?
When y contains many more 0’s than 1’s (or vice versa), standard logistic regression training might bias the predictor toward the majority class. Methods include:
Collecting more data from the minority class if possible.
Using class weights in the cost function to penalize mistakes on the minority class more heavily.
Oversampling the minority class or undersampling the majority class, though this may introduce overfitting or reduce the data size.
All these techniques help adjust the learning process to pay more attention to the minority class, producing a more balanced classification outcome.