ML Interview Q Series: What makes the binary cross-entropy loss used in logistic regression convex in its parameters?

Apr 09, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

The binary cross-entropy (also referred to as the negative log-likelihood) in logistic regression is well-known to be a convex function when viewed in terms of the model parameters. This characteristic ensures a single global minimum can be found using standard gradient-based optimization methods. Below is the central formula for the average binary cross-entropy loss in logistic regression, shown for a dataset of size N:

Connect with me on X (Twitter)

Here:

L(w) is the average loss over all training examples.
y_i is the binary label for the ith data point, where y_i in {0, 1}.
x_i is the feature vector for the ith data point.
w is the parameter vector we aim to learn.
sigma(z) = 1/(1 + e^(-z)) is the logistic (sigmoid) function, mapping any real number to the (0,1) interval.

Why This Loss is Convex

Convexity with respect to the parameters means that if you draw the graph of L(w) as a function of w, it has a "bowl-shaped" form with no local minima other than the global one. The reason the above expression is convex in w can be understood by looking at how the sigmoid function and the negative log operation compose together:

The sigmoid function transforms a linear function w^T x_i into a probability, ensuring that it stays between 0 and 1.
Taking the negative log of this probability in a specific arrangement (y_i ln(...) + (1 - y_i) ln(...)) leads to a function whose second derivative with respect to w is positive semidefinite. In more intuitive terms, the curvature introduced by -log(·) on top of the sigmoid “balances out” in such a way that the overall expression is convex in w.

Mathematical Insight (Second Derivative Argument)

If one computes the Hessian (the matrix of second partial derivatives) of the binary cross-entropy with respect to w, it turns out to be positive semidefinite, indicating convexity. Intuitively, exponentials and negative logarithms interact in a manner that yields a globally convex shape. Although the logistic (sigmoid) function itself is not convex over z, the way it appears inside the negative log-likelihood integral produces a convex function in the parameter space.

Practical Significance of Convexity

Because the binary cross-entropy objective is convex with respect to w, you can reliably apply gradient-based methods (e.g., gradient descent, stochastic gradient descent, or quasi-Newton methods like L-BFGS). These methods are guaranteed to converge to the global minimum, provided the learning rate and other hyperparameters are set properly.

Example Python Snippet

import numpy as np

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def binary_cross_entropy_loss(w, X, y):
    # X is of shape (N, D), y is of shape (N,)
    # w is of shape (D,)
    # N is the number of samples, D is the dimensionality
    N = X.shape[0]
    predictions = sigmoid(X.dot(w))
    # To avoid log(0), clip predictions
    eps = 1e-15
    predictions = np.clip(predictions, eps, 1 - eps)
    loss = -np.mean(y * np.log(predictions) + (1 - y) * np.log(1 - predictions))
    return loss

This function computes the exact objective described above. You can then take gradients with respect to w (using an automatic differentiation library or manual derivation) and update w until convergence.

Are There Any Cases Where It Might Not Appear Convex?

Despite the theoretical guarantee of convexity in w, certain numerical issues, poor scaling of features, or extremely large feature spaces can make it challenging to observe the nice bowl-shaped structure in practice. Still, from a purely mathematical standpoint, the function remains convex, and these issues can often be mitigated by proper normalization or regularization.

Follow-Up Questions

Could You Explain How the Logistic Function Itself Is Not Convex, Yet the Overall Loss Is Convex?

The logistic function (the sigmoid) is S-shaped and not convex when viewed as a function of z. However, we do not directly optimize the sigmoid alone; we optimize a negative log-likelihood built on top of it. The composition of a linear function (w^T x_i) with the sigmoid and then the negative log yields a function in w that is convex. The interplay of the concavity of the log and the convexity of the exponential ensures the final expression is convex in parameter space.

Does Convexity Guarantee a Unique Global Minimum?

Yes. If a function is strictly convex and differentiable, there will be a unique global minimum. For logistic regression, the binary cross-entropy loss is indeed convex in w, though if you have perfect collinearity in features, it could lead to some degenerate cases in practice. Typically, you will still identify a single best solution (up to directions in parameter space that do not affect the decision boundary).

What Is the Role of Regularization in This Framework?

Adding L2 regularization (weight decay) or L1 regularization (lasso) to the binary cross-entropy objective maintains convexity in w for the L2 case. For L1, the function remains convex but is no longer differentiable at w=0 in each component. Nevertheless, the L1-regularized logistic regression problem is still convex and can be solved using subgradient or coordinate descent methods.

How Do We Extend This to Multi-Class Problems?

For multi-class classification, the logistic function is replaced by the softmax function, and the binary cross-entropy is replaced by the softmax cross-entropy. It remains convex in its parameters (in the sense that it has a single global minimum), though the parameter space is typically larger (e.g., one weight vector per class, or a matrix of parameters).

Could the Optimization Get Stuck in a Local Minimum Despite Convexity?

Convexity implies no local minima other than the global one. Numerical issues, however, could lead to very slow progress or near-plateaus that might feel like “traps.” Nonetheless, mathematically speaking, as long as the function is convex and well-defined everywhere, you will not encounter a true local minimum. Proper learning rates, initialization, and data scaling help avoid extremely flat regions or numerical instability.

How Is This Used in Practice?

Most binary classification tasks with a linear decision boundary can use logistic regression with a binary cross-entropy loss. Modern frameworks like PyTorch, TensorFlow, and scikit-learn have built-in functions (e.g., nn.BCEWithLogitsLoss in PyTorch) that handle the numerics in a stable way. The gradient-based solvers then optimize the weights w to minimize this convex objective, making logistic regression a straightforward, interpretable, and robust solution for many classification scenarios.

Rohan's Bytes

Discussion about this post