ML Interview Q Series: How can you illustrate two different techniques to visualize the internal representations learned by a convolutional neural network in an image classification setting?

Apr 06, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

Convolutional Neural Networks operate by applying multiple learnable filters to an input image. These filters capture spatial hierarchies of features as we progress through the layers. Visualizing these learned features or the activations they produce can help us interpret the model's decision process. Many practitioners focus on two main strategies: visualizing the learned filters themselves, and visualizing the intermediate activations or saliency maps.

Connect with me on X (Twitter)

Core Mathematical Operation Underlying CNNs

To understand why visualizing learned features is possible, we typically recall the discrete convolution operation that CNNs rely on. The convolution of a 2D input image I with a 2D kernel K can be written as:

Here, I is the 2D input (like an image), and K is the learned kernel or filter of size (2m+1) x (2n+1). The output is the activation map at position (x, y). Each learned filter K tries to detect a specific pattern, such as edges, corners, textures, or more complex shapes at deeper layers.

Visualizing the Learned Filters

Filters in the first convolutional layer can often be directly visualized as small patches. Each filter typically corresponds to an edge detector, color blob detector, or other low-level structures. In deeper layers, filters become more abstract. Viewing them can still be done by either:

• Directly plotting the filter weights as a grid of images when they are small enough. • Using optimization-based techniques that attempt to create an image that maximally activates each filter.

By examining these visual representations, one gains an intuition about the kind of features each filter is extracting from the input.

Visualizing Activation Maps or Saliency

Another method focuses on observing how the network responds to a particular input by examining intermediate feature maps or by generating saliency maps:

• Feature maps: If we feed a single image through the CNN and visualize the intermediate activations, we can see what parts of the image are activating certain filters. This highlights which spatial regions are most relevant for each feature detector. • Saliency maps and Grad-CAM: Techniques like Gradient-weighted Class Activation Mapping (Grad-CAM) use gradients of the target class with respect to feature maps to compute a coarse localization map. This heatmap is then upsampled and overlaid on the original input to see which regions contributed most strongly to the classification. Such maps can be very illustrative of how the network localizes objects.

Both strategies—filter visualization and activation/saliency map visualization—are commonly employed to understand CNN behavior and confirm that the features align with human understanding of the task.

Follow-up Question: How does viewing the learned filters help diagnose potential problems in a CNN?

When you look at the learned filters in early or intermediate convolutional layers, you might find that some filters do not show clear structure or appear random. This can indicate a few scenarios. One possibility is that training was insufficient, or learning rates were improperly tuned, causing the model to fail to converge on meaningful features. Another is over-regularization, causing many filters to be suppressed and not learn any discriminative patterns. You might also discover that certain filters never get utilized, hinting at potential network architecture mismatches, or an imbalance in the dataset causing the CNN to devote more capacity to other classes of features.

Observing random or uniform filters can be an early warning sign that something in the training procedure should be revisited, such as checking if the data pipeline is correct, verifying that gradient backpropagation works, or adjusting hyperparameters like learning rate, regularization strength, or momentum factors.

Follow-up Question: What are potential drawbacks or limitations of using Grad-CAM or other saliency-based methods?

Saliency-based methods like Grad-CAM, Guided Backpropagation, or vanilla Saliency Maps can be sensitive to noise and minor perturbations in the input. Sometimes, small image distortions can significantly change the highlighted regions, reducing interpretability. Also, these methods often rely on gradients, so they provide a local sensitivity measurement, not necessarily a complete global understanding of the model’s decision process.

Another limitation is that saliency maps can be somewhat coarse, especially for higher-layer representations. Grad-CAM’s output is typically a heatmap that must be upsampled to the original image resolution, so finer details of the network’s reasoning may be lost. Additionally, saliency maps can be misleading if they highlight edges or textures that coincide with important object structures but do not necessarily reflect how the entire CNN reasons about the class identity.

Follow-up Question: Can you provide a simple example in Python for generating intermediate feature maps?

Below is a conceptual example using PyTorch. It demonstrates how to extract and visualize feature maps from a specific convolutional layer of a CNN.

import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
import matplotlib.pyplot as plt

# Suppose we use a pretrained model (e.g., ResNet18)
model = models.resnet18(pretrained=True)
model.eval()

# Let's pick the first convolution layer for feature map visualization
# In ResNet, it's model.conv1. But you might want an intermediate layer for deeper features.
selected_layer = model.conv1

# Transform for the input image
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])

# Load an example image
img = Image.open('example_image.jpg').convert('RGB')
input_tensor = transform(img).unsqueeze(0)  # shape: (1, 3, 224, 224)

# Pass the image through the selected layer
with torch.no_grad():
    feature_maps = selected_layer(input_tensor)  # shape: (1, num_filters, H, W)

# Convert feature maps to numpy for visualization
features = feature_maps.squeeze(0).detach().cpu().numpy()  # shape: (num_filters, H, W)

# Plot a few filters
num_filters_to_show = 6
fig, ax = plt.subplots(1, num_filters_to_show, figsize=(12, 2))
for i in range(num_filters_to_show):
    ax[i].imshow(features[i], cmap='viridis')
    ax[i].axis('off')
plt.show()

In this snippet, we apply the first convolution layer to an input image and then visualize the output feature maps. These plots reveal which shapes or edges the CNN is emphasizing at the very first layer. One can extend this approach to deeper layers (like layer1, layer2, or layer3 in ResNet) to observe higher-level feature extraction.

Follow-up Question: How do these visualization techniques translate to tasks outside of classification, like object detection or segmentation?

The same interpretability techniques generally apply because object detection and segmentation models often rely on backbone CNN architectures for feature extraction. For instance, in Faster R-CNN or Mask R-CNN, you could visualize the feature maps within the backbone or region proposal network. Saliency-based methods can also help highlight which parts of the image guided the detection or segmentation process. In tasks like segmentation, heatmaps may align more directly with semantic regions since the model output is already more spatially focused. However, caution is warranted because detection and segmentation pipelines sometimes have more complex multi-stage processes. Feature map visualization can still provide valuable insights, but you must pay close attention to which module or sub-network you are examining.

Follow-up Question: Are there potential performance impacts or implementation pitfalls when computing saliency maps in real-world applications?

One challenge is that computing saliency maps or Grad-CAM for large batches or very high-resolution images can be expensive. This is because additional gradient computations may be required beyond standard forward and backward passes. In practical applications like real-time deployments, the overhead may be too high unless you implement these computations efficiently or resort to approximate methods.

Implementation pitfalls might include mixing up layers when hooking into intermediate activations or incorrectly handling data that bypasses specific layers (such as skip connections). Ensuring that the computational graph remains intact for gradient-based methods is also crucial; inadvertently detaching from the graph can produce incorrect or empty saliency outputs. Thorough testing is essential to confirm that the visualization matches the correct target class and the correct layer of interest.

Follow-up Question: How might model interpretability and visualization methods evolve in the future?

There is ongoing research to produce more robust, fine-grained, and user-friendly interpretability tools. Some next-generation methods involve:

• Integrated Gradients with feature-attribution techniques that deliver improved theoretical guarantees about how they distribute importance across inputs. • More advanced generative approaches that attempt to invert deep feature representations, revealing the model’s internal notion of class structures or object prototypes. • Interactive visualization frameworks that allow real-time layer-by-layer or neuron-by-neuron exploration of model internals.

These newer methods strive to give more holistic, reliable insights into how a CNN reasons about input data, bridging the gap between purely quantitative performance metrics and user trust or explainability requirements.

Below are additional follow-up questions

How can we interpret feature representations in deeper CNN layers that do not yield simple visual patterns?

One major challenge with advanced CNN architectures is that deeper filters tend to capture more abstract or highly composite features. Unlike early layers—where filters might clearly correspond to edges or color blobs—the deeper layers are often sensitive to combinations of shapes or texture motifs. Because these features are more entangled, direct visualization of the raw filters may appear noisy or uninformative.

A frequently used approach is to employ activation maximization with regularization or techniques like DeepDream. These methods generate synthetic inputs that heavily activate a particular filter or neuron. Even though the visualizations can be surreal, one can often notice recurring shapes or contexts that point to which abstractions the network is encoding. Another route is to examine the receptive field of each neuron in the deeper layers by pinpointing which regions in a dataset’s images most activate that neuron. By clustering these highest-response patches, you may reveal consistent semantic concepts (e.g., eyes, wheels, or flower petals) that the neuron is detecting.

Potential pitfalls arise if a neuron is highly sensitive to a very specific subset of the training data or if the regularization used in activation maximization inadvertently obscures or stylizes the underlying features. The synthetic images might suggest coherent shapes when, in reality, the model’s learned feature is more distributed or context-dependent. Balancing interpretability with faithfulness to the underlying learned representation is key.

When could CNN visualization methods give a false impression of a model’s reliability?

Visualization techniques can sometimes highlight visually appealing filters or crisp saliency maps that might imply a strong understanding of objects in the scene. However, the model could be relying on spurious correlations or background cues to make predictions. For instance, if a dataset’s background color or watermark is a strong predictor of a label, the CNN might latch onto these artifacts. Saliency maps might show that the object region is highlighted, but in reality the network also heavily relies on peripheral cues not captured in the visualization.

A second scenario is when the saliency map is overly smooth or broad, making it appear as though the CNN focuses on the correct object area. However, the model might be ignoring crucial details that truly differentiate the class. Over-reliance on a single visualization approach can obscure these failings.

Edge cases include adversarial examples, where slightly altered inputs cause substantial changes in predictions yet might yield similar saliency maps. This discrepancy can make the visualization look consistent while the underlying decision boundary is fragile. Such a case demonstrates that interpretability tools can be fooled or might not reflect model vulnerability.

How does extensive data augmentation or domain shifts affect visualization and interpretability?

Data augmentation (e.g., random cropping, flipping, color jitter) introduces variability that can alter how filters develop. When visualizing filters from a heavily augmented dataset, you may notice more robust or invariant features. The CNN might learn to focus on texture-like patterns (color, edges) rather than specific orientations or positions because random rotations or flips force filters to generalize. This can be beneficial, but it can also obscure interpretability. If the filters appear more abstract or uniform, it might be harder to glean definitive concepts from a direct look.

Domain shifts—where the test data has different properties than the training set—can also distort visualization efforts. If the CNN is confronted with new textures or lighting conditions, its intermediate activations might deviate significantly from what you observed during training. This can reduce the reliability of saliency maps or heatmaps that were validated only on training or in-distribution data. For instance, some feature maps might respond poorly or not at all to unseen domain characteristics, making the visualizations less meaningful for debugging real-world performance problems.

How can unsupervised or generative methods aid CNN filter interpretation?

Unsupervised or generative models can complement standard CNN visualization by learning the underlying data distribution and providing realistic samples. One approach is to pair a CNN with a generative adversarial network (GAN) that can synthesize images which strongly activate certain CNN neurons or filters. Because GANs produce samples that look more like natural images, the resulting visualizations can be more human-interpretable. This helps in revealing whether a particular filter corresponds to certain semantic components rather than arbitrary patterns.

There are also methods that attempt to invert feature representations. For instance, feature inversion techniques reconstruct approximate input images from high-level CNN activations. If these reconstructions consistently show certain objects or shapes, it offers evidence that these deeper features represent specific semantic concepts. A potential caveat is that the inversion process can be imperfect, leading to artifacts, or it might not capture all the information that the CNN uses for classification.

How do CNN visualization approaches differ from those used for Transformer-based architectures?

Transformer models—originally popularized in natural language processing and now increasingly used in vision tasks (e.g., Vision Transformers)—rely on attention mechanisms. Visualization typically focuses on attention maps instead of convolution filters. For Vision Transformers, attention maps can be overlaid on the input image to show how different patches attend to each other. This is conceptually akin to saliency but arises from attention weights rather than spatial convolution.

While CNN filter visualization can be quite direct (since filters are literal matrices of weights), Transformers incorporate multi-head attention with complex positional encoding. The interpretability of these attention mechanisms can be more subtle. A high attention weight does not always signify a critical decision factor, but rather a correlation or gating mechanism. Consequently, interpretability researchers have to develop specialized techniques (like attention flow or gradient-based analyses) to dissect how the model processes the input. Additionally, layer-by-layer filter introspection in CNNs is replaced by head-by-head attention analysis in Transformers, which might be less intuitive to visualize.

What best practices can be followed when integrating CNN visualizations into model debugging workflows in production?

A recommended practice is to automate the generation of specific visualization artifacts (e.g., filter grids, saliency overlays) at regular intervals during model development. This allows one to quickly spot regressions—like filters that collapse to a single weight value—or confirm improvements—such as clearer activation maps after hyperparameter tuning. Integrating these checks into continuous integration pipelines ensures interpretability is not neglected.

It is also crucial to test visualizations on multiple representative inputs, including corner cases. If you only look at correctly classified examples, you might overlook how the network handles difficult or mislabeled samples. Reviewing saliency maps for both correct and incorrect predictions can highlight biases or illusions of correct focus.

One must remain aware of privacy constraints when dealing with user data. If the inputs are sensitive, direct visual debugging might not be permissible. In such cases, it may be necessary to rely on aggregated or anonymized visualizations.

Lastly, visual interpretability should be coupled with quantitative metrics. Depending solely on visual impressions can be misleading if a filter looks neat but has no positive effect on classification accuracy. Measuring class-wise performance and confidence calibration—alongside interpretability tools—helps ensure robust deployment.

How can CNN visualization be used in regulated industries or safety-critical applications (like healthcare or autonomous driving)?

In high-stakes domains, decision-making transparency can be a regulatory requirement. Visualizing the features can demonstrate to regulators and stakeholders that the model is attending to relevant regions (e.g., the tumor in a medical image). If the saliency map is consistently highlighting irrelevant tissue, that might invalidate the system’s credibility.

However, there are pitfalls. Regulators might over-interpret or misinterpret a saliency map, assuming it is an absolute explanation rather than a partial lens. Also, in settings like autonomous driving, the real-time nature of the system can make it difficult to store intermediate activations or generate saliency maps for every frame.

A systematic approach involves periodic audits where a subset of model inferences is subjected to thorough interpretability checks. Domain experts can then confirm whether the network focused on correct features (e.g., pedestrian bounding boxes) or if there are potential safety hazards (like ignoring small but relevant objects). Over time, this forms part of a compliance and risk management strategy in data-driven applications.

Rohan's Bytes

Discussion about this post