ML Interview Q Series: Explaining Complex Model Predictions with LIME, SHAP, and Feature Importance.

Jun 12, 2025

📚 Browse the full ML Interview series here.

Model Interpretability: Many real-world applications require explaining model decisions. If you have a complex model (say a random forest or neural network) making important predictions (for loans, medical decisions, etc.), how can you make the model’s predictions more interpretable to users or stakeholders? Discuss methods like feature importance scores, partial dependence plots, LIME or SHAP explanations for individual predictions, or using simpler surrogate models for explanation.

Making a complex model’s predictions more interpretable is crucial in high-stakes applications. Approaches can be divided into global (model-level) interpretability and local (instance-level) interpretability. Global methods aim to describe how the model uses the features overall, while local methods generate explanations for specific single predictions. Below is a comprehensive discussion of these techniques and how to use them effectively in real-world scenarios.

Global Interpretability

Techniques under this heading help you understand the overall decision patterns of the model and how different features affect predictions in general.

Feature Importance Scores

For many models, such as Random Forests, one approach is to compute feature importance. This can be done in various ways:

Using built-in importance metrics. Tree-based models often track how much each split reduces impurity (for example, Gini impurity in a classification tree). Summing those reductions over all trees can yield a “Gini importance” measure. However, this measure can be biased if certain features have more splits purely by virtue of having many possible split points.

Using permutation importance. In this approach, you measure the model’s baseline performance on a validation set, then shuffle one feature’s values and measure the drop in performance. Features whose randomization hurts performance more are considered more “important.” This approach is generally more reliable than built-in importance metrics because it doesn’t depend on the model’s internal splitting criterion.

While feature importance helps, it only indicates how important certain inputs are on average, not necessarily how they change the model’s output for specific values or particular instances.

Partial Dependence Plots

Partial Dependence Plots (PDPs) show how the model’s predicted output changes as you vary one (or two) input features while holding other features at fixed reference values (often their average or median in the dataset). This helps you see whether increasing a given feature leads to systematically higher or lower predictions.

For example, if you have a feature called “Income” and you want to see how it influences loan approvals, you can generate a PDP that plots the average predicted probability of loan approval as income increases from a low to a high range. This plot may reveal non-linear relationships or threshold effects.

A subtlety arises when features are correlated. By holding certain features fixed at their average, you may be exploring values in your PDP that hardly occur in reality (for instance, high “Income” but very low “Credit Score,” if these are typically correlated). This can lead to misleading interpretations. Methods like Accumulated Local Effects (ALE) plots have been proposed as alternatives that handle correlated features more reliably by focusing on local shifts.

Surrogate Models for Explanation

Surrogate modeling refers to the practice of training a simpler, more interpretable model (such as a shallow decision tree or linear model) on the predictions made by the more complex “black-box” model. You then interpret the simpler model to glean insight about how the complex model behaves.

The usual workflow is to take your original dataset, generate predictions using the black-box model, and use these predictions as a “label” for training a simple, interpretable surrogate. For instance, you can fit a decision tree on the black-box predictions to approximate overall decision boundaries. By examining the tree’s structure, you might find approximate rules that describe the black-box model’s logic. Although this method is good for gleaning broad trends, it may fail to capture intricate interactions and might be inaccurate in certain regions of the input space.

Local Interpretability

Local methods produce explanations targeted at a single prediction. They are extremely useful when you want to explain why the model gave a particular output for a specific instance.

LIME (Local Interpretable Model-Agnostic Explanations)

LIME approximates the complex model in the vicinity of a single data point with a simpler, interpretable model (e.g., a small linear model). It does the following:

It creates perturbed samples around the instance you want to explain (slightly changing the feature values). It gets predictions from the complex model for each perturbed sample. It fits a simpler model (often linear) on these perturbed samples, trying to replicate the behavior of the black-box model locally. From that simpler model, it extracts the local explanation, typically as feature weights that indicate how each feature influenced the black-box model’s prediction near that point.

One main challenge is the choice of the kernel that defines “locality” and how many perturbed samples you generate. If the local region is not well-chosen, LIME can produce misleading explanations.

SHAP (SHapley Additive exPlanations)

SHAP is a theoretically grounded technique based on Shapley values from cooperative game theory. It attributes credit (or blame) to each feature by measuring how much that feature changes the model’s prediction when it is added into consideration. In practice:

It calculates a “baseline prediction,” often using a background dataset or the model’s mean prediction. For a particular instance, it systematically looks at subsets of features (or approximates this in a more computationally feasible way) to figure out how much each feature contributes. It allocates feature contributions such that they sum up to the difference between the baseline and the actual prediction.

SHAP provides both local explanations (feature contribution for a single instance) and global interpretability (by aggregating many SHAP values). Since it uses principles from game theory, it has nice properties: it is fair (features that contribute more to the output get higher SHAP scores) and it sums up to the difference between the instance-level prediction and the average. The main drawback is computational complexity for large datasets or very large feature sets, though efficient approximations (e.g., KernelSHAP, TreeSHAP) exist.

Putting It All Into Practice

In mission-critical contexts like loans or medical decisions, interpretability can be approached by combining these methods:

You might start with global feature importance (e.g., permutation importance) to see which features the model relies on. You can create partial dependence plots or ALE plots to gain an understanding of how certain features influence predictions on average. You can then pick key instances (such as loan applicants who were borderline accepted/rejected, or patients with critical health conditions) and apply local explanation methods like LIME or SHAP to produce instance-level explanations. When you need simpler approximate “if-then” rules for quick communication, you might train a surrogate decision tree to mimic the complex model’s behavior and share that with stakeholders.

Through these steps, both the overall logic (global interpretability) and individual decisions (local interpretability) become clearer and more transparent.

Practical Example of Using SHAP in Python

import shap
import xgboost as xgb
import numpy as np

# Example dataset
X_train = np.random.randn(1000, 5)
y_train = np.random.randint(0, 2, size=1000)

# Train a model (example using XGBoost)
model = xgb.XGBClassifier().fit(X_train, y_train)

# Create SHAP Explainer
explainer = shap.TreeExplainer(model)

# Compute SHAP values for training set
shap_values = explainer.shap_values(X_train)

# Summarize the results
shap.summary_plot(shap_values, X_train)

Here, you see a summary plot that visualizes which features contribute most to predictions globally and how. You can also visualize local explanations by calling shap.force_plot.

Challenges and Key Considerations

Interpretability can be subjective. Different stakeholders may require different forms of explanation (executives might want simple narratives, data scientists might want more detailed feature interactions). Some techniques can be computationally expensive for large models or big datasets. Certain plots (like PDPs) can be misleading when features are correlated. In high-stakes domains, you might need to validate that local explanations are stable (i.e., small perturbations don’t wildly change the explanation). There’s always a trade-off between model fidelity (the complexity you need to solve the prediction problem accurately) and interpretability (the transparency you want for stakeholders).

Follow-up Questions and Detailed Answers

What are the differences between LIME and SHAP in terms of explanation stability and theoretical grounding?

Both produce local explanations, but SHAP is based on cooperative game theory and guarantees properties like consistency and additivity. This means that if a model changes such that a feature contributes more to every possible coalition, that feature’s SHAP value will not decrease. LIME relies more on a local linear approximation and can produce different explanations depending on the random sampling around the instance. SHAP’s theoretical grounding typically leads to more stable explanations across different runs, but it can be more computationally intensive.

How can partial dependence plots be misleading for correlated features, and what are alternatives?

When features are correlated, partial dependence plots fix one feature while marginalizing or averaging over others, which can present unrealistic combinations of feature values. For highly correlated features, these plots may show a relationship that never truly appears in the actual dataset. ALE (Accumulated Local Effects) plots attempt to mitigate this by focusing on small local intervals of feature values and calculating the effect in those intervals. They sum these local effects across the distribution of the feature, which can provide more reliable insights when correlation is strong.

Why might random forest feature importance rankings be unreliable, and how to address it?

Standard impurity-based feature importances in random forests can be biased for several reasons: They favor continuous features with many split points or features with higher cardinality. They can inflate the importance of correlated features (multiple correlated features all get high Gini importance because they provide similar splits).

Permutation-based feature importances typically reduce these biases by measuring the actual drop in predictive performance when the feature is randomized. It is often recommended to rely on permutation importance, especially for real-world interpretability.

When training a surrogate model to explain a black-box model, how do we ensure fidelity?

To ensure the surrogate’s predictions approximate the black-box model well, you must: Train on a representative sample of inputs covering a broad region of the input space. Ensure your evaluation metric reflects how closely the surrogate tracks the black-box predictions, not just the ground truth. Potentially train local surrogates (like LIME) in different regions if a single global surrogate does not capture complex interactions.

If the surrogate’s fidelity is low, the explanations drawn from it could be misleading.

Can interpretability methods substitute real domain expertise?

Interpretability techniques offer insight into how the model uses features, but domain experts bring critical knowledge about whether those relationships make sense in practice. If an interpretable model suggests that a certain demographic factor is essential in loan decisions, domain experts and ethical or regulatory considerations might reveal that using that factor directly or indirectly could be problematic. Hence, domain expertise is crucial for validating that the features used and the learned relationships are appropriate, fair, and consistent with domain knowledge.

How do you balance model performance with interpretability in regulated industries?

Regulated sectors (banking, healthcare, etc.) often require that decisions be auditable and explainable. One strategy is to start with the simplest interpretable model that achieves acceptable performance, such as a smaller decision tree or logistic regression with carefully selected features, if that meets domain accuracy requirements. If you must use a more complex model, complement it with robust interpretability strategies like SHAP or LIME, well-documented partial dependence plots, and thorough domain expert evaluation. You must also ensure you can produce the required disclosures or reason codes for decisions (often mandated by regulations like “adverse action notices” in lending).

How do you evaluate the quality of explanations produced by methods like LIME or SHAP?

One method is to perform a “local fidelity check” where you compare the black-box model’s predictions to the simple explanation model in a neighborhood of the point being explained. If the local fidelity is high, it indicates the explanation is faithful. Another approach is to conduct user studies or domain expert surveys to see if the explanations match known relationships or real-world expectations. You can also test the stability of explanations. If small perturbations to the input drastically change the explanation, you might suspect the explanation is too sensitive or the model is near a complex decision boundary.

How can correlated features affect SHAP values?

If multiple features are correlated, the question of how to properly allocate credit for the model’s output becomes more nuanced. SHAP tries to distribute contribution fairly based on the idea of marginal contributions, but correlated features may lead to overlapping explanations: each correlated feature might share partial responsibility. In practice, SHAP can handle correlations fairly well compared to methods that measure importance only by direct removal or addition, but the presence of highly correlated features still complicates interpretation. Sometimes domain insights or dimensionality reduction can help reduce correlation prior to modeling.

Are there any interpretability differences between deep neural networks and tree-based models with these techniques?

The general post-hoc interpretability techniques (LIME, SHAP, surrogate models, PDPs) can be applied to both neural networks and tree-based ensemble models. Tree-based models often come with built-in feature importance mechanisms, while neural networks may require approaches like gradient-based saliency (in computer vision) or integrated gradients in addition to the more general LIME/SHAP methods. For large-scale neural networks (like deep learning models for images, NLP, etc.), methods that exploit model structure—like Grad-CAM for convolutional neural networks—can be extremely helpful. However, the fundamental principles remain similar: global vs. local, and direct vs. surrogate-based explanations.

How do you handle interpretability in an online learning or streaming context?

In an online learning scenario, your model may update frequently as new data arrives. Techniques like LIME and SHAP still apply, but you need to recompute explanations regularly, or maintain a dynamic mechanism to approximate them in real time. Maintaining partial dependence plots in streaming contexts can be harder because the distribution of data might shift, so you need to update your reference distributions or historical data samples for the explanation. Some organizations choose to store snapshots of the model at intervals for audit purposes, so they can revisit a model’s state at key decision points if needed.

Below are additional follow-up questions

What is the difference between “interpretability” and “explainability”? Are these terms truly interchangeable, or do they have nuanced distinctions?

Interpretability typically refers to the extent to which a human can understand the cause-and-effect relationship within a model—essentially, how a model arrives at its decisions. When a model is interpretable, we can look at it (or at summaries or important features) and form an accurate mental picture of how it processes inputs and yields outputs. Interpretability usually implies relatively direct insight into the model’s inner mechanisms or structure.

Explainability can often have a slightly different emphasis. Many practitioners use “explainability” to describe any approach (post-hoc or otherwise) designed to clarify why a model made a certain prediction, even if that explanation doesn’t necessarily reflect the actual internal computation. For example, surrogate models (like a simple decision tree approximating a random forest) or instance-level methods (like LIME or SHAP) are frequently described as “explainability” tools because they produce explanations that humans can reason about, without necessarily offering direct transparency into the full complexity of the original black-box model.

In practical scenarios:

A model can be interpretable by design, such as a single decision tree or linear regression with only a few features. The structure itself is easy for humans to parse.
A model can be explainable through post-hoc techniques, but still be largely a “black box” in terms of actual structure (like a deep neural network or an ensemble with hundreds of trees). We rely on explanation techniques (like LIME, SHAP, partial dependence) to provide a window into its behavior.

Thus, while the terms are sometimes used interchangeably, the nuance is that interpretability often implies a more direct or inherent transparency, whereas explainability can involve producing human-readable justifications after the fact.

Are there pitfalls when using local interpretability methods (LIME, SHAP, etc.) in adversarial settings or when the model is designed to be adversarially robust?

In an adversarial setting, the model or the environment might be deliberately manipulated to fool explainability tools or to produce misleading explanations. Potential pitfalls include:

Perturbation Attacks: Local explanation methods like LIME create perturbed samples around the instance to learn a local decision boundary. An adversary could design a model (or manipulate features) such that small perturbations near a point lead the model to output drastically different predictions, confusing the local explanation and making it untrustworthy.
Gradient Masking or Defense Mechanisms: Robust models sometimes employ gradient obfuscation or other techniques to resist adversarial attacks. This can also interfere with certain explanation methods that rely on gradients or that assume a smoothly varying prediction surface. As a result, SHAP or LIME might yield stable-looking but incorrect explanations.
Misleading Model Behavior: If the model is intentionally designed to present plausible but deceptive feature relationships, standard local methods may not detect the deception. They only approximate how the model behaves around a specific instance and can miss hidden “backdoors” or triggers that radically alter predictions if certain hidden feature patterns arise.

In adversarial or security-sensitive contexts, it’s important to:

Validate explanations across multiple samples or multiple perturbation strategies.
Consider robust or certifiable explanation approaches, which attempt to provide worst-case bounds on how the model behaves.
Conduct thorough audits and pen-testing of the model’s interpretability pipeline, just like you would for security vulnerabilities.

If the model uses high-dimensional embeddings (like in NLP or deep learning on images), how can we interpret these features meaningfully?

When models operate on high-dimensional feature vectors (like word embeddings, sentence embeddings, or pixel embeddings from convolutional layers), the raw dimensions often lack direct human interpretability. To cope with that:

Dimensionality Reduction Visualization: Sometimes, you map embeddings into a lower-dimensional space using techniques like t-SNE or UMAP. This creates a 2D or 3D visualization where semantically similar data points cluster. Such plots can help identify general relationships or groupings in the latent space.
Saliency and Attention Mechanisms: In NLP, models with attention layers can highlight which words are most influential for a particular output. In vision tasks, gradient-based saliency or attention maps (like Grad-CAM) can reveal which regions of an image the network relies on. This partially addresses interpretability by pointing to the relevant elements in the raw input, rather than the abstract embedding itself.
Probing Classifiers: Another approach is to train small “probe” models on the learned embeddings to see what types of semantic or syntactic information they encode. For example, for word embeddings, one might train linear probes to predict part-of-speech tags or other linguistic features from the embeddings. The success of these probes can indicate what information is embedded.
Feature Ablation: With local explanation methods, you can ablate or mask certain parts of the embedding (like zeroing out subsets of dimensions or entire word tokens) to see how the prediction changes. This is more relevant in frameworks like LIME for text classification, where you treat words or tokens as “features” and measure how removing them affects the model’s output.

The main pitfall is that the embedding itself might not map to direct “human-readable” semantics on a dimension-by-dimension basis. We often rely on analysis methods that project or highlight only the portion of the embedding that influences the model’s decisions. This yields interpretability in terms of input tokens or image regions, rather than the hidden dimensions of the embedding space.

How should we respond if an interpretability analysis reveals potentially discriminatory patterns or biases in the model’s decisions?

When an analysis exposes that certain protected attributes (race, gender, age, etc.) or proxies for them significantly influence predictions in a way that may be discriminatory, organizations must take immediate steps to address it:

Identify Root Causes: Check if the dataset has inherent historical bias, or if the data collection process is skewed. Bias can creep in if the training set is unrepresentative of certain demographics or if certain negative outcomes are systematically over-represented in the data.
Fairness Metrics and Constraints: Adopt formal metrics (like disparate impact, equalized odds, or demographic parity) to quantify the level of bias. Then consider building or adjusting the model with fairness constraints that reduce or eliminate these biases.
Data Augmentation or Rebalancing: If the interpretability method indicates that certain features are correlated with protected attributes and lead to biased decisions, gather more data or augment the dataset to ensure coverage of underrepresented groups. This might help the model learn less biased relationships.
Adversarial Debiasing / Post-processing: Use advanced methods to de-correlate feature representations from sensitive attributes. For instance, adversarial debiasing trains the model to make accurate predictions while preventing a parallel adversarial network from inferring the protected attribute.
Policy and Governance: Legal or regulatory frameworks might require you to provide reasons for decisions. Once you see evidence of potential discrimination, you may need to revise your entire pipeline—data collection, feature engineering, model selection—and document any remediation steps for compliance and ethical purposes.

A major pitfall is failing to act on these findings. While interpretability tools can highlight concerns, organizations must also create policies and processes for handling them. Another challenge is distinguishing legitimate differences (e.g., medical risk that correlates with age due to actual physiological factors) from illegitimate discrimination (e.g., a loan model that denies credit to certain racial groups). Domain expertise and fairness frameworks help to make these distinctions.

How does one approach interpretability for multi-modal models that process multiple data sources (e.g., images plus text)?

Multi-modal models process and combine diverse inputs (like text, images, and sometimes tabular data). Interpreting such models can be more complex, as the interaction across modalities may not be intuitive. Approaches include:

Modality-Specific Explanation Tools: Use saliency maps for the image portion (highlighting important regions) and attention-based explanation or token importance for the text portion. This may be done in a single pipeline where you generate per-modality explanations and then combine them.
Joint Attention Mechanisms: Some multi-modal architectures use cross-attention or co-attention layers that explicitly align text tokens with image regions. Visualizing these attention weights can reveal how the model associates certain words or phrases with particular parts of the image.
Feature Importance via SHAP or LIME: You can treat each modality as a distinct “chunk” of features and measure how removing or altering the image part, or sections of the text, changes the final prediction. For instance, a multi-modal LIME variant might randomly mask the image or remove certain sentences to see which combination drives the model’s outputs the most.
Surrogate Models per Modality: Train simpler models that replicate the multi-modal model’s decision but focusing on each modality. Then, if you combine them (or interpret them individually), you get a partial view of how each input type contributes. This can be supplemented by a final surrogate that merges the partial surrogates, though fidelity might be harder to maintain.

Pitfalls include the possibility that the combined effect of two modalities is crucial (e.g., certain textual descriptions matter only when the image has specific features). Explaining synergy or interaction across modalities can be difficult with standard single-modality visualization techniques. It can also be computationally expensive to create a robust explanation if your model processes large images and lengthy text.

When should we prefer directly interpretable models over post-hoc explanations of complex models?

Directly interpretable models (like a simple decision tree, a linear model with few features, or rule-based systems) are often chosen when:

Regulatory or Legal Requirements: Certain industries require not only that you provide an explanation, but also that the model itself be transparent in structure. Healthcare decisions or certain credit scoring processes might necessitate a model that can be audited thoroughly.
Simplicity of the Problem: If a simple model achieves near state-of-the-art performance, then it’s typically better for clarity and maintenance to adopt it rather than a more complex black box. Complexity introduces interpretability challenges and potential for unexpected behavior.
Resource Constraints: Interpretable models may be faster to train, require fewer specialized libraries for interpretation, and be easier for stakeholders to understand. This is especially relevant if your team has limited experience with advanced black-box methods or you need near-instant interpretability without post-hoc overhead.
High Risk of Mistrust: In highly sensitive contexts, stakeholder trust might hinge on seeing a decision process that’s inherently understandable. If the cost of confusion or misinterpretation is extremely high (e.g., life-and-death medical diagnoses), a direct interpretable model might reduce risk.

Pitfalls of relying solely on inherently interpretable models:

They might not achieve the needed accuracy for extremely complex tasks. There can be a trade-off between interpretability and raw predictive power, especially for tasks like image classification or language understanding at large scales.
The simplicity might mask real-world complexity: If real patterns are non-linear or heavily interactive, a simple model could misrepresent them, leading to systematic errors.

If accuracy demands push you to more complex architectures, you then rely on post-hoc explanation methods. But the final choice often depends on balancing the cost of errors versus the need for transparency.

Rohan's Bytes

Discussion about this post