ML Interview Q Series: In what way do we carry out the training process for a Logistic Regression model?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Logistic Regression is fundamentally a linear model that maps a linear combination of features into a probability between 0 and 1. The key to training this model is finding parameters that minimize a suitable loss function, typically the binary cross-entropy (log loss). Below is the core mathematical expression for the logistic (sigmoid) function:
Here, z is the linear combination w^T x + b, where w is the weight vector, x is the feature vector, and b is the bias term. The sigmoid squashes real-valued inputs z into the range (0, 1), which is interpreted as the probability that the sample belongs to the positive class.
The cost function for Logistic Regression is the average cross-entropy loss:
Here, N is the total number of training examples, y_i is the true label (0 or 1), and hat{y}_i is the predicted probability for the i-th example given by the sigmoid function. The term w in J(w) is the set of parameters (weights) we want to learn.
The training process follows these steps:
We initialize w and b (often randomly or set to small values like zeros). We compute the predicted probabilities hat{y}_i for each training sample x_i using the sigmoid function. We calculate the cross-entropy loss J(w). We compute the gradient of J(w) with respect to each parameter (w and b). This involves partial derivatives of the cost function with respect to these parameters. We update w and b using gradient-based optimization (gradient descent or variants such as stochastic gradient descent, mini-batch gradient descent, Adam, etc.). We iterate this process until convergence, i.e., until changes in the loss become negligible or we reach a predetermined number of epochs.
Below is a simple Python code snippet showing a basic implementation of logistic regression training using gradient descent (for illustration):
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def compute_loss_and_gradients(X, y, w, b):
m = X.shape[0]
z = np.dot(X, w) + b
predictions = sigmoid(z)
loss = -np.mean(y * np.log(predictions) + (1 - y) * np.log(1 - predictions))
# Gradients
dw = (1/m) * np.dot(X.T, (predictions - y))
db = (1/m) * np.sum(predictions - y)
return loss, dw, db
def logistic_regression_train(X, y, lr=0.01, epochs=1000):
# Initialize weights and bias
w = np.zeros(X.shape[1])
b = 0.0
for i in range(epochs):
loss, dw, db = compute_loss_and_gradients(X, y, w, b)
# Update parameters
w -= lr * dw
b -= lr * db
if i % 100 == 0:
print(f"Epoch {i}, Loss: {loss:.4f}")
return w, b
# Example usage:
# X_train, y_train are assumed to be the training data and labels
# w, b = logistic_regression_train(X_train, y_train, lr=0.01, epochs=1000)
This code performs iterative updates to find w and b that minimize the cross-entropy loss.
What if the training data is not linearly separable?
Logistic Regression does not require the data to be linearly separable. It finds a best-fit linear boundary by optimizing its cross-entropy loss, effectively positioning the decision boundary in a way that maximizes the likelihood of the observed labels. Even if the data is not perfectly separable, the model attempts to minimize misclassification. In many real-world cases, the data indeed is not strictly separable, but Logistic Regression can still yield a probability estimate that is often useful in practice.
Does Logistic Regression only work for binary classification?
Logistic Regression in its basic form is for binary classification. However, for multi-class problems, one can generalize via strategies like One-vs-Rest or One-vs-One. In One-vs-Rest, for example, you train a separate binary classifier for each class, treating that class as positive and all others as negative. The final prediction is typically the class whose classifier returns the highest predicted probability.
Why do we use cross-entropy instead of mean squared error?
Cross-entropy directly measures the difference between two probability distributions—our model’s predicted distribution versus the true distribution (from labels). This is more aligned with the probabilistic interpretation of Logistic Regression and ensures correct gradient behavior for faster convergence. Mean squared error can lead to slower convergence and poorly calibrated probabilities because it does not align with the logistic function in a way that preserves the probabilistic interpretation.
How can we avoid overfitting?
Using regularization methods such as L2 (ridge) or L1 (lasso) helps penalize large weight values and prevent overfitting. Most implementations of Logistic Regression include a regularization parameter. Early stopping during iterative optimization is also helpful if the validation loss starts to increase. Cross-validation is used to tune hyperparameters like regularization strength so that we do not overfit to a single training set.
Does feature scaling matter?
While Logistic Regression does not intrinsically depend on features being at similar scales to produce a correct model, feature scaling often improves training stability and speed. In gradient-based approaches, having features at vastly different magnitudes can result in slower or unstable convergence. Standardizing the features (subtracting the mean and dividing by the standard deviation) is a common approach.
Can Logistic Regression be used for large datasets?
Yes. Logistic Regression can handle large datasets efficiently, especially when combined with stochastic or mini-batch gradient descent. Libraries like scikit-learn, TensorFlow, and PyTorch allow for training on large datasets by batching. If the dataset is extremely large, one can sample mini-batches or use online updating. However, for very high-dimensional data, the computational cost and memory requirements can become a concern, and more specialized optimizations or distributed computing might be necessary.