ML Interview Q Series: How can you handle sample-dependent costs (e.g., different misclassification penalties for different examples) in a standard deep learning framework?

Mar 28, 2025

📚 Browse the full ML Interview series here.

Hint: Multiply the loss by example-specific weights or dynamically adjust the gradient for each sample

Comprehensive Explanation

A common requirement in machine learning tasks is to treat certain examples as more critical than others. This often arises when misclassification of certain data points (e.g., fraudulent transactions in financial data) has a higher penalty compared to others. The fundamental approach is to assign different weights to each example and incorporate these weights into the loss calculation. Many deep learning frameworks (such as PyTorch or TensorFlow) allow you to multiply the per-sample loss by a scalar weight that reflects the importance or cost of misclassification for that example.

Connect with me on X (Twitter)

One way to express a weighted loss function is to sum (or average) the individual per-sample losses, each multiplied by its corresponding weight. For a general loss L_i for sample i, and weight w_i for sample i, the total weighted loss can be written in a simplified form as:

Here, N is the number of samples, L_i is the unweighted loss for sample i, and w_i is the sample-dependent weight capturing how significant or costly an error on that sample is. When performing backpropagation, each loss term contributes to the gradient scaled by the corresponding weight w_i, effectively placing more emphasis on examples with higher weights.

If one needs to handle different classes with different costs, then a class-specific weighting scheme can be employed. Alternatively, if each example has a unique penalty, you can assign example-specific weights. The procedure remains the same.

In practice, to implement this in deep learning frameworks like PyTorch, you can either use built-in features for class weighting (for example, nn.CrossEntropyLoss(weight=class_weights) for class-level weighting) or use a manual technique to multiply each sample’s loss by its corresponding weight. An example approach in PyTorch is shown below.

import torch
import torch.nn as nn
import torch.optim as optim

# Example neural network
model = nn.Sequential(
    nn.Linear(10, 50),
    nn.ReLU(),
    nn.Linear(50, 2)
)

# Example input and target
inputs = torch.randn(5, 10)  # batch_size=5, features=10
targets = torch.tensor([0, 1, 1, 0, 1])

# Define your per-sample weights (in practice, this could come from metadata about costs)
sample_weights = torch.tensor([1.0, 2.0, 1.5, 1.0, 3.0])

# Define your basic loss function without weighting
loss_func = nn.CrossEntropyLoss(reduction='none')

# Forward pass
logits = model(inputs)
loss_unreduced = loss_func(logits, targets)

# Multiply each sample's loss by the corresponding weight
weighted_loss = sample_weights * loss_unreduced

# Compute the mean (or sum) to get final scalar loss
final_loss = weighted_loss.mean()

# Backpropagate
optimizer = optim.SGD(model.parameters(), lr=0.01)
optimizer.zero_grad()
final_loss.backward()
optimizer.step()

This example shows how each sample can have a different penalty. The key steps are to compute the loss for each sample individually, multiply by the sample weight, and then reduce it with either a mean or sum operation. When you call backward(), the gradients are automatically scaled according to these weights.

You could also implement custom gradient scaling by altering the backward pass if you prefer direct gradient manipulation. However, multiplying the loss by sample-dependent weights is usually the most straightforward and less error-prone approach.

You can apply the same principle in TensorFlow/Keras by utilizing the sample_weight parameter in loss functions or by writing a custom training loop in which you multiply the per-sample losses by the sample-specific costs.

How to Dynamically Adjust Gradients

Instead of directly multiplying the per-sample loss, you could intercept the gradient during backpropagation and multiply each gradient by a factor that depends on the sample cost. This is a more advanced approach often used when you want even more fine-grained control. For instance, you might wish to increase or decrease gradients only when errors exceed a certain threshold. Most practitioners find it simpler to incorporate dynamic costs by weighting the loss, but gradient hooks or custom backward passes provide another layer of flexibility.

Handling Edge Cases

A key consideration is ensuring that extremely large or small sample weights do not destabilize training. Excessively large weights might lead to exploding gradients, whereas extremely small weights might effectively mute the learning signal for that sample. Normalizing weights or placing regular constraints on their range can help mitigate these issues.

If you have an imbalanced dataset, class weights are frequently normalized so that the effective emphasis on each class is proportionally balanced rather than dominated by a small but highly weighted subset. Similarly, if each sample has a cost that changes over time, you can simply reassign or recalculate weights at each iteration or epoch.

Follow-Up Question 1

How do you determine the appropriate weight values in practice?

In many cases, the weights are set based on domain knowledge or class imbalance. If one class is encountered only 1% of the time but misclassification of that class is particularly costly, a common heuristic is to increase its relative weight inversely proportional to its frequency, or scale it based on a manually defined ratio that reflects the cost of errors. In tasks involving variable financial losses, the weight for each sample might directly correspond to the monetary loss of misclassifying that sample.

Follow-Up Question 2

What if you are dealing with multi-label or multi-class classification where each class can have a different cost or different penalty in case of misclassification?

In multi-class or multi-label classification with different costs per class, you can maintain a vector of class weights or even a matrix of penalties. For instance, a confusion cost matrix can specify the cost of misclassifying class j when the true class is i. You would then incorporate these values into the loss computation. If you need a per-sample approach beyond per-class, you can assign each sample its own weight just as before. At runtime, you would multiply the loss for each sample by that sample’s class- or instance-specific cost. The underlying math and code remain largely the same; only the weight or penalty value changes based on each sample’s metadata.

Rohan's Bytes

Discussion about this post