ML Interview Q Series: Why is the ReLU activation function frequently chosen instead of Sigmoid in deep neural network architectures?

Apr 05, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

One core difference between these two popular activation functions is how they handle gradients and outputs during backpropagation in deep networks. The ReLU activation tends to mitigate the vanishing gradient problem that often arises when using Sigmoid.

Connect with me on X (Twitter)

Sigmoid Function

Here, z is the linear input w*x + b. The Sigmoid squashes its input into a range between 0 and 1. Although this is beneficial for certain tasks where a probabilistic interpretation is needed, Sigmoid can cause the gradients to saturate for large positive or negative z values, thereby slowing or even stalling the learning process in deep networks.

ReLU Function

Here, z is the linear input w*x + b. ReLU outputs zero for negative inputs and passes the positive input directly forward. This leads to a piecewise linear behavior. The gradient for all positive inputs is 1, and for negative inputs, the gradient is 0. This simple form generally accelerates convergence in deeper networks because it avoids saturating gradients in the positive region and keeps the gradient flow more robust during backpropagation.

Key Advantages of ReLU Over Sigmoid

Faster Training: Because ReLU has a constant gradient of 1 for z > 0, it avoids the flat-slope region that plagues Sigmoid, allowing for faster weight updates and convergence.

Mitigation of the Vanishing Gradient Problem: In Sigmoid, as z grows large (positively or negatively), the gradient becomes extremely small. ReLU’s positive side maintains a gradient of 1, so deeper layers can keep getting a relatively strong gradient signal.

Sparsity in Activations: ReLU outputs zero for any negative input. This creates sparse representations in hidden layers because many neurons output 0, which can lead to more efficient computations and less overfitting.

Better Gradient Flow in Deep Architectures: ReLU’s gradient is not confined to a small range, so it helps networks preserve the magnitude of gradients through many layers. This makes deep architectures easier to optimize.

Practical Example in Python

import torch
import torch.nn as nn
import torch.optim as optim

# Simple neural network with ReLU activation
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(10, 50)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(50, 1)  # Example: regression or single output

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Dummy training loop
model = SimpleNet()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

for epoch in range(5):
    inputs = torch.randn(32, 10)
    targets = torch.randn(32, 1)

    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()

    print(f"Epoch {epoch}, Loss: {loss.item()}")

In this example, using nn.ReLU helps ensure that gradients remain strong for positive-valued activations in the hidden layer, often leading to faster training compared with using Sigmoid in deeper layers.