ML Interview Q Series: How would you improve your CNN to handle mislabeled pug/pit bull data and tough conditions like fog?

May 01, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

A neural network trained on mislabeled or partially unreliable data for two similar classes (pugs and pit bulls) can easily learn incorrect mappings. Furthermore, harsh environmental conditions such as rain or low visibility can degrade feature quality. Addressing these problems involves rethinking how you handle the training data labels and how you manage model generalization under uncertain conditions.

Connect with me on X (Twitter)

Incorporating Label Smoothing

One approach to tackle systematic labeling errors is to use label smoothing, which reduces the network’s tendency to become overly confident in a single label. Instead of training with a single, one-hot label, each class label can be softened to account for the possibility of mislabeling. The typical formula for label smoothing, where alpha is the smoothing parameter, K is the total number of classes, and y_i is the original ground-truth label indicator, looks like this:

In this expression, hat{y}_i is the new soft label for class i, alpha determines how much smoothing is applied, and K is the total number of possible classes (all dog breeds in the dataset). This encourages the network to maintain some probability mass for other classes instead of being extremely certain about a single category. When pugs and pit bulls are frequently confused or mislabeled, label smoothing helps the model learn more robust, less overconfident decision boundaries.

Refining Data Quality

Another practical angle is to reduce the level of mislabeling. Whenever feasible, improving the dataset by identifying mislabeled pugs and pit bulls and correctly relabeling them is crucial. A small portion of accurately relabeled data can guide the network more effectively. More thorough data verification, active learning where uncertain samples are flagged for human review, or semi-supervised learning approaches can all help improve label quality.

Employing Data Augmentation

Training a model for robust performance in challenging conditions can be aided by incorporating domain-relevant data augmentation. If the robot will see dogs in fog, rain, or at a distance, synthetic or real transformations that simulate these scenarios can be added to the training pipeline. Examples might involve artificially blurring images, changing brightness/contrast, or applying random occlusions. By experiencing realistic variations during training, the network can learn more invariant features, making it more resilient under suboptimal conditions.

Using More Flexible Loss Functions

When there is a high chance of confusion between two classes, specialized loss functions or weighting schemes can be used. For instance, a cost-sensitive or focal loss can help the model adjust the penalty for misclassification of certain classes. If misclassification of pugs and pit bulls is particularly costly, weighting those classes more heavily in the loss computation can direct the model’s capacity toward getting those distinctions right.

Introducing Auxiliary Tasks or Multi-Task Learning

Sometimes, giving the network more context can help it differentiate similar classes. An auxiliary task might be to classify the dog’s size, ear shape, or muzzle shape, training the model to extract more detailed features. This additional supervision can help the network learn more discriminative representations that separate visually similar breeds such as pugs and pit bulls. Multi-task learning setups, where the network must solve a related classification or detection problem, can often improve the primary classification accuracy as well.

Gathering Additional Data

Collecting targeted samples of pugs and pit bulls under varied conditions and labeling them accurately can be crucial. A balanced dataset with enough samples of each breed under different lighting and weather conditions is invaluable. If real-world data collection is challenging, simulated data from high-quality renderings or curated expansions of the existing dataset can also assist in bridging the gap between controlled training conditions and the complexity of the real world.

Implementation Example in PyTorch

Below is a short snippet illustrating how label smoothing might be implemented in a PyTorch training loop. This simple example highlights how you could manipulate targets before computing cross entropy to incorporate a smoothing factor.

import torch
import torch.nn as nn
import torch.optim as optim

# Suppose 'model' is your CNN, 'train_loader' is your data loader
# with images and numeric breed labels.

model = ...  # your model
optimizer = optim.Adam(model.parameters(), lr=1e-4)
alpha = 0.1  # smoothing factor
num_classes = 10  # example: total dog breeds

def smooth_labels(labels, alpha, num_classes):
    # Create a smoothed label tensor (one-hot with smoothing)
    # labels shape: (batch_size,)
    with torch.no_grad():
        off_value = alpha / (num_classes - 1)
        on_value = 1.0 - alpha
        labels_onehot = torch.full((labels.size(0), num_classes), off_value)
        labels_onehot.scatter_(1, labels.unsqueeze(1), on_value)
    return labels_onehot

criterion = nn.KLDivLoss(reduction='batchmean')  # Kullback–Leibler divergence
                                                # often used with log-softmax outputs

for images, labels in train_loader:
    images = images.cuda()
    labels = labels.cuda()

    # Forward pass
    outputs = model(images)  # shape: (batch_size, num_classes)
    log_probs = nn.functional.log_softmax(outputs, dim=1)

    # Convert labels to smoothed distribution
    smoothed_targets = smooth_labels(labels, alpha, num_classes).cuda()

    # Compute the KLDivLoss
    loss = criterion(log_probs, smoothed_targets)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

In this example, the smooth_labels function converts the hard labels into soft distributions. The KL divergence loss is used with log-softmax outputs to compare distributions. This is one illustrative way to incorporate label smoothing.

Potential Follow-Up Questions

Could we simply remove the confusing classes and focus on the other breeds?

Removing problematic classes seems like a quick fix but can undermine the robot’s ultimate goal. If you remove pugs and pit bulls, you lose coverage for dogs that actually need to be rescued. From a data science perspective, ignoring hard classes is rarely ideal. Although you might reduce misclassifications, you are also eliminating important categories and diminishing the model’s utility. It is generally more effective to refine the dataset, introduce label smoothing, or use alternative techniques that preserve all necessary classes.

How can we quantify the improvement in performance for these two classes?

A confusion matrix is a standard tool to evaluate classification performance by showing true classes versus predicted classes. After retraining the network, you can compare the confusion matrix before and after applying label smoothing, data augmentation, or other methods to see how the pug-pit bull confusion ratio changes. You might also monitor class-wise precision, recall, and F1 scores. Another approach is to use the per-class accuracy or precision-recall curves for pugs and pit bulls.

Are there possible downsides to label smoothing?

If label smoothing is set too high (large alpha), the network may oversmooth the training signals, losing valuable details and weakening the network’s capacity to distinguish among classes. It can also reduce the interpretability of the model’s predicted probability distributions since the network no longer assigns near-1.0 values to the predicted class. Tuning alpha based on validation performance helps avoid these pitfalls.

Could we use more advanced techniques like Bayesian Neural Networks or ensembles?

Bayesian methods and ensemble approaches can further account for uncertainties in model weights and predictions. In a Bayesian Neural Network, weights are treated as distributions, offering a better reflection of uncertainty. Ensembles can combine several independently trained networks, often improving overall performance and robustness. Both methods can be beneficial in scenarios where classes are hard to differentiate, and visual conditions are poor.

What if the mislabeled data is higher than 50%? Are there other ways to reduce its negative impact?

More extreme mislabeling levels might require active data cleansing, where the model flags examples it is uncertain about or which conflict with the existing labels. Those examples can then be prioritized for human review. Another possibility is using robust loss functions such as a noise-robust variant of cross-entropy or employing techniques from weak supervision, where multiple noisy labels are aggregated into a more reliable label estimate. These methods can help in systematically cleaning or adjusting labels in heavily corrupted datasets.

Below are additional follow-up questions

What if the physical appearance of certain pugs or pit bulls is atypical, causing confusion beyond just mislabeling?

When a breed deviates from its usual appearance (e.g., a very large pug or a pit bull with particular coat markings), the classification model might rely on visual cues that no longer hold. This atypical appearance can amplify confusion, especially if the mislabeled dataset also has unusual examples. One potential solution is to include more variants of each breed in the training set, covering a wide spectrum of appearances. Data augmentation can help simulate subtle morphological differences, but ultimately, having real-world images of unusual-looking dogs is critical. Another idea is to utilize a two-stage approach: one model for broad breed categorization and a specialized sub-model for borderline or uncertain cases, thus ensuring that even atypical samples are handled carefully. A pitfall arises if we assume all real-world dogs look similar to the canonical breed standards, leading to poor performance in the field when encountering outliers.

How do we address sudden domain shifts, such as a new environment or a different camera sensor, that might exacerbate breed confusion?

Domain shifts occur when the robot encounters lighting, scenery, or camera conditions drastically different from those in the training data. Even a robust model might falter if it has never seen examples taken from drastically altered viewpoints or captured by sensors with different noise characteristics. To counter this, domain adaptation techniques can be employed. One strategy is unsupervised domain adaptation, in which a feature extractor is learned to be invariant to the domain. Another approach is domain randomization, intentionally training on multiple synthetic variations of environments to build resilience. A potential edge case is if you rely solely on classical data augmentation (like brightness changes) but ignore sensor noise or lens distortion unique to your robot. Overfitting to a single sensor type or environment can be catastrophic if the robot must operate in a wide range of locations or with different camera hardware.

Could we benefit from a hierarchical classification scheme before identifying specific dog breeds?

When distinguishing pugs from pit bulls is especially tricky, a hierarchical model can first classify dogs into coarse categories (small brachycephalic vs. large muscular types) and then refine the classification with a narrower sub-classifier. This modular approach can make the model more interpretable and potentially reduce the negative impact of mislabeled data in the final breed category. However, one subtlety is ensuring the upper-level classifier is accurate. If it misroutes a pit bull to the category for small breeds, the second-stage classifier might wrongly confirm it as a pug. So, both levels must be trained with robust data, or else error propagation can occur. Another pitfall is deciding how many hierarchical layers to use and how to define the grouping so that it genuinely reflects distinct morphological or behavioral traits.

How might partial occlusion or only partial views of the dog (e.g., the tail, back, or silhouette) affect the model’s ability to distinguish the breeds?

The network’s reliance on facial or torso features may be challenged if the camera angle only shows the dog’s rear or if the dog is partially hidden behind an object. Distinguishing pugs from pit bulls can be particularly challenging because some of their silhouettes might appear similar in certain poses. Approaches to mitigate this include training on partial views, using attention-based mechanisms that can highlight critical regions, or leveraging sequential data in videos for context. A real-world hazard arises if the robot relies on a single snapshot to decide. If that image is uninformative—like a rear angle with a large tail wag or just the dog’s legs—the classification can be completely off. A solution is to gather multiple frames or vantage points whenever possible, then fuse the predictions for a final decision, thereby reducing the risk from a single ambiguous pose.

Is it worthwhile to incorporate other sensory data, such as audio (barking) or even smell detection, into the breed classification process?

Yes, adding more modalities can help disambiguate visually similar breeds. Audio signals, like distinct barking patterns, or even advanced sensors (e.g., identifying unique chemical signatures for different dogs) can enrich the decision. Multimodal fusion can significantly reduce confusion if the additional signals are reliable. Nonetheless, incorporating new modalities introduces complexities, such as synchronizing data streams, dealing with sensor noise, and requiring labeled data for those modalities. If you discover that the audio sensor fails in a noisy city environment or that odor sensors are slow to respond, these edge cases might nullify any benefits of additional data. Careful feasibility studies and sensor integration strategies are crucial to avoid system overload or unreliable predictions.

What if the confusion between pugs and pit bulls has safety implications? How can we ensure false positives or false negatives are minimized?

In rescue scenarios, a false negative for a pit bull that is indeed missing could mean the robot overlooks the dog in danger, while a false positive might result in misdirected effort or ignoring an actual pug. To handle such safety-critical contexts, you can incorporate an adjustable decision threshold that lets you tune precision and recall for critical classes. You could require a higher confidence level for these classes, or use a post-processing logic that triggers additional checks if the confidence is borderline. Another approach is cost-sensitive learning, where errors for certain classes incur higher penalties in the objective function. The pitfall is if the threshold is set too high for pit bulls, resulting in many false negatives, thus failing to rescue them. Balancing these competing demands (precision vs. recall) requires domain-specific calibration and testing in realistic conditions.

If we adopt continual learning to update the model while the robot is operating, how do we protect against catastrophic forgetting?

Continual learning allows the model to refine its understanding of pugs and pit bulls on the fly by leveraging new images and labels encountered in the real environment. However, neural networks can suffer from catastrophic forgetting, losing performance on previously learned tasks as they adapt to new data. To mitigate this, techniques like Elastic Weight Consolidation or experience replay can be used. Elastic Weight Consolidation penalizes large shifts in parameters critical for old tasks, while experience replay stores a small subset of historical data to retrain in tandem with new data. The edge case is if you rely solely on real-time data in an environment that never presents certain classes (e.g., the robot sees pit bulls but almost no pugs for weeks). The model might overfit to pit bulls and degrade in its pug classification. A balanced replay buffer that ensures coverage of all relevant classes is essential.

Rohan's Bytes

Discussion about this post