Demystifying the Black Box: The Rise of Explainable AI

Shuvam Aich
Mar 21
19 min read

Is Model understanding needed in every AI/ML Application?

Understanding how machine learning models are behaving is probably much higher in settings like health care or criminal justice or like loan approval decisions and these kinds of settings because the decisions that we are making there are probably going to impact people's life, their health and their livelihood and so on (High Stake Situations). Whereas if Facebook, for example, recommends a wrong friend to you, you would-- it's fine. Like, you would probably not be very pleased, but that's not the end of the world.

When and Why Model Understanding?

Not all applications require understanding of models and their predictions. So for example, things like ad or a product or friend recommendations and so on, and also settings where there is no human intervention involved. So there is no human who is looking at a prediction and trying to make a final decision. So when there is no human intervention involved in the final decision-making in such applications, you may not really even need to think about model understanding. Because there are very few consequences for incorrect predictions in these cases. Of course, if you make a lot of incorrect predictions, then a company might lose revenue and so on. But as long as the number of incorrect predictions you're making are few, the consequences for those are much less.

High-stakes decision-making settings:

Impact on human lives/health/finances
Settings relatively less well-studied, models not extensively validated

Accuracy alone is no longer enough:

Train/test data may not be representative of data encountered in practice

Auxiliary criteria are also critical:

Nondiscrimination
Right to explanation
Safety

Auxiliary criteria are often hard to quantify (completely):

For example: Impossible to predict or enumerate all scenarios violating the safety of an autonomous car.

Incompleteness in problem formalization:

Hinders optimization and evaluation.
Incompleteness ≠ Uncertainty; Uncertainty can be quantified.

Model understanding becomes critical when:

Models are not extensively validated in applications.
- Train/test data is not representative of real-time data.
Key criteria are hard to quantify, requiring a "you will know it when you see it" approach. (when it is hard to quantify the key criteria)

Model understanding facilitates Debugging: So let's say that now we have a predictive model, which takes as input some images, and then it makes predictions about the animal that's in the images. And this model was built by a developer, so you pass this image and then it's rightly predicting that this image is Siberian husky. But if you sort of probe the model a little bit more and see what are the areas of the image that the model is focusing on, what you see is that the model is focusing on the snow pieces. So in some sense, while the prediction looks OK, once you see the features that the model is looking at or paying attention to, you see that it's actually not looking at the animal but rather the background or trying to find snow in the image. And that's what it is using to predict the image as Siberian husky.

So in some sense, this understanding is basically what tells the developers that the model is relying on spurious or incorrect features, and maybe we need to debug this model and improve it. So here is a clear use case where model understanding is helping with debugging of models, which several of us, whether researchers, engineers, scientists who use machine learning models, can find incredibly valuable.

Facilitates BIAS DETECTION:

There is a predictive model that is being used by a particular judicial facility to make decisions about if people should be released on bail or not. So these kinds of models take the details of the defendant, like for example, the defendant's socioeconomic status, their demographic attributes, their past criminal history, and so on, and predict if the person should be released or not or is the person too risky to be released or not.

So in this case, the model is taking the details of this defendant and making a prediction. But then this does not give any information to the judge whether they can rely on this prediction or not.

But if we show the judge some of the key features that the model is focusing on, for example, maybe the model is looking at race and gender when making this prediction, the judge will realize that this prediction is basically biased. Decisions based on features like race and gender, should not happen. So the judge is going to hopefully disregard such a prediction and make his own decision.

RECOURSE - Provide recourse to individuals who are adversely affected by model predictions:

So banks often have such models which they employ in order to sort of predict who should be given a loan. So the loan applicant's details or their application will be passed on to the bank. The bank then runs their predictive model and determines if a person should get a loan or not.

Now, let's say an applicant is denied a loan. If we just tell that person your loan is denied, sorry, that's it, that's not very helpful because now the person doesn't know what to do anymore. So they don't have a means for further action.

But instead, if we actually tell them, increase your salary by say 5K or 10K or 50K, pay your credit card bills on time for, say, about three months, then come back and reapply and you'll get a loan, that is actionable information that the person can use in order to sort of work on their profile, come back, and reapply.

TRUST - Helps assess if and when to trust model predictions when making decisions: If we think of a health care scenario, so if there is a doctor that is relying on model predictions and the model is basically taking a bunch of data and labeling patients as like healthy or sick, now the doctor, again, if they just look at the predictions, they have no way of knowing which ones to rely on, which ones they should apply their own judgment to, and so on. But if we had a full understanding of what the underlying model is potentially doing, so for example, the doctor sees that the model is using irrelevant features like ID numbers when making predictions on female subpopulation, the doctor can quickly realize that and therefore, they should not trust the model's predictions, at least for that subgroup.

DEPLOYMENT SUITABILITY - Allows us to vet models to determine if they are suitable for deployment in real world:

In continuation of the previous scenario, if an authority like FDA looks at that explanation of what the model is doing, the authority might realize that this model is just using irrelevant features on half the population, which is basically all the females. And so it's not ready for deployment at this point. So in this sense, model understanding can also allow us to vet models to determine if they're suitable for real world deployment or not.

Achieving Model Understanding

Approach 1: Build inherently interpretable predictive models

Models like linear regressions, logistic regressions, or shallow decision trees, or like small rule-based models are often considered as inherently interpretable because you can look at the set of rules, and as a human, we can understand what the model is doing while making its predictions. Of course, one could argue that if I give you a decision tree with like 100 levels, that's not interpretable anymore. So even within these classes of models, the models that are sparser, shallow, those are the ones that we can look at, read off a piece of paper, and try to make some sense of them.

Approach 2: Explain pre-built models in a post-hoc manner

We can have a black box model like an extremely complex deep neural network model, and then pass it as inputs to our explainer algorithms. And they, in turn, will break these down into simpler models that we can understand. So for example, you can take a whole, let's say, a 200 layer deep neural network and then pass it through an explanation algorithm, and that algorithm can then maybe give you important features associated with the particular prediction.

Inherently Interpretable vs Post-hoc:

The choice between inherently interpretable models and post-hoc explanations depends on the nature of the data, the complexity of decision boundaries, and practical constraints like data availability. Below is an analysis based on the provided discussion.

1. When Inherently Interpretable Models are Sufficient

If a simple model—such as a decision tree with five levels—achieves high accuracy, then it is preferable. This is because:

The model is naturally transparent, eliminating the need for post-hoc explanations.
It provides direct interpretability without additional computational overhead.
There is no trade-off between accuracy and interpretability, making it the optimal choice.

Example:

A two-dimensional dataset that is linearly separable (e.g., distinguishing between emails as spam or not based on word frequency) can be handled well by logistic regression or a decision tree.
A low-dimensional structured dataset where decision rules can be explicitly modeled.

2. When Post-hoc Explanations Become Necessary

In real-world applications, datasets are often high-dimensional, complex, and non-linearly separable. In such cases, inherently interpretable models may fail to achieve high accuracy. This happens in:

Natural Language Processing (NLP) – Text data is inherently high-dimensional and requires models like transformers (e.g., GPT, BERT).
Computer Vision – Image data has highly complex patterns, requiring deep learning (CNNs).
Large Tabular Data – Ensemble methods like Random Forests and Gradient Boosting (e.g., XGBoost) outperform simple decision trees.
Biomedical Data – Diseases and genetic patterns exhibit complex non-linear relationships.

Example:

If the dataset has highly curved, non-linear decision boundaries, simple models fail to generalize, making deep neural networks or ensemble methods necessary.
If data is limited (e.g., hospitals deploying AI models for patient scheduling), organizations may have to use a proprietary black-box model, leaving them dependent on post-hoc explanations.

Final Takeaway

If an inherently interpretable model provides sufficient accuracy → use it.
If accuracy demands a complex model → use post-hoc explanations.
In constrained settings where only black-box models are available → post-hoc explanations are the only option.

3. Summary of Decision Criteria

Scenario	Preferred Approach	Reason
Data is low-dimensional and simple	Inherently Interpretable	No need for complexity; high accuracy achievable with simple models
Data is high-dimensional and complex	Post-hoc Explanations	Simple models lack accuracy, requiring black-box models
Decision boundaries are linear or nearly linear	Inherently Interpretable	Simple models suffice for accurate predictions
Decision boundaries are highly non-linear	Post-hoc Explanations	Complex models better capture patterns
Computational efficiency is a concern	Inherently Interpretable	Lower training and inference time
Model must be explainable due to regulations (e.g., GDPR)	Inherently Interpretable	Guarantees direct interpretability
Black-box models are the only available option (e.g., proprietary AI systems)	Post-hoc Explanations	No alternative but to interpret a pre-trained model

Post-hoc Explanations

What is Explanation?

An explanation is an interpretable description of model behavior.

Two key properties:

The first thing is the explanation should faithfully describe the behavior of this classifier. If the explanation is not correctly describing model behavior, then essentially it's not useful even if it is interpretable to the user.

And the other side of it is whatever we are producing should be interpretable to the end user.

The sort of complexity in this entire scenario comes from what exactly do we mean by understandable to the end user. And that depends quite a bit on the nature of the end user themselves. So whether they are machine learning experts, whether they're domain experts, all of those aspects.

There are different ways to explain a complex model to a user:

Share Model Parameters – If the user understands machine learning, providing model parameters (like weights) can help them interpret the model.
Show Example Predictions – Displaying inputs and corresponding outputs helps users see how the model behaves.
Summarize with Rules or Trees – A decision tree or simple rules can capture key patterns in the model.
Highlight Important Features – Identifying the most influential factors in a prediction makes the model's reasoning clearer.
Explain How to Change Predictions – Showing what adjustments would flip the outcome helps users understand decision boundaries.

Each approach provides insight into the model’s behavior, depending on the audience’s expertise.

So for example, when we think of that we just need to provide an interpretable description of the model behavior, you could just send all the model parameters theta, and if this is somebody who is building a model themselves or is a scientist researcher engineer who understands machine learning, they may be able to make some sense of it. Now I can't send all the model parameters theta to a doctor and assume that they would make any sense of it.

Local v/s Global Explanations:

Local explanations: the goal of these explanations or these methods is to explain individual predictions of the model. So if we have one prediction, how is that prediction coming about or what are the factors that are impacting that prediction.

Global explanations on the other hand, they try to describe for the most part complete behavior of the models. So they try to give a global picture of the model's behavior.

Local explanations typically help us unearth any kinds of biases or model’s reliance on bogus features, in a given local neighborhood of an instance whereas global explanations help shed light on big picture issues or big picture biases affecting larger subgroups in the population.

So while local explanations help vet if individual predictions are being made for the right reasons, global explanations help vet if a model at a high level is suitable for deployment.

Feature Importance

Feature importance methods help explain a model’s predictions without requiring access to its internal structure. This makes them model-agnostic, meaning they can be applied to any model, including black-box systems.

Key Advantages:

Not restricted to specific models – Works with decision trees, neural networks, SVMs

Easy to implement – No dependency on PyTorch, TensorFlow, or specific frameworks.

Useful for black-box models – Can analyze models even when internal details are unavailable (e.g., proprietary AI systems).

Common Model-Agnostic Techniques:

SHAP (SHapley Additive exPlanations) – Assigns a contribution value to each feature using cooperative game theory.
LIME (Local Interpretable Model-agnostic Explanations) – Approximates complex model behavior using simple interpretable models.

These techniques help researchers and engineers understand model behavior, even when they lack access to the model’s architecture.

LIME:

So you literally take a point, perturb it a bunch of times, generate a local neighborhood, and get the model's predictions on that local neighborhood and then fit a linear model on those instances and their predictions.

Step-by-Step Explanation

LIME (Local Interpretable Model-agnostic Explanations) works by creating a local surrogate model that approximates the behavior of a black-box model around a specific instance. Here’s how it works:

1. Identify Important Dimensions

LIME selects the most influential features that impact the model’s prediction.
It ranks these features based on their relative importance in determining the output.

2. Generate Perturbed Samples

It slightly modifies the input instance
These perturbed samples help analyze how small changes affect the model’s decision.

3. Predict Labels Using the Black-Box Model

The black-box model makes predictions for each of these perturbed samples.
This allows LIME to understand how the model responds to similar but slightly different inputs.

4. Weigh Samples Based on Distance to Original Instance

Samples that are closer to the original input get higher importance (weight).
This ensures that the explanation remains faithful to the original instance rather than generalizing too much.

5. Train an Interpretable Model on the Weighted Samples

LIME fits a simple, interpretable model (e.g., linear regression) to the perturbed data.
This model acts as a local approximation of the black-box model’s behavior.

6. Explain the Prediction

The interpretable model reveals which features contributed most to the prediction.
The explanation helps users understand why the black-box model made a certain decision.

Customization in LIME

LIME is highly customizable, allowing users to control various aspects of the explanation process. Here’s how:

1. How to Perturb the Input?

LIME generates perturbed samples by modifying the original input slightly. The method depends on the data type:

Tabular Data: Randomly remove or modify feature values.
Text Data: Remove or replace words with similar alternatives.
Image Data: Remove or alter pixel regions (e.g., by turning them gray or blurring).

2. How to Measure Distance/Similarity?

LIME assigns weights to perturbed samples based on their distance to the original instance.

Common Distance Metrics:
- Euclidean distance (for numerical data)
- Cosine similarity (for text data)
- Kernel-based weighting (exponential smoothing function)

3. How Local Should the Explanation Be?

LIME focuses on local explanations, meaning it only approximates the black-box model’s behavior near the given instance.

A smaller neighborhood → More faithful to the instance but less generalizable.

Customization: You can adjust the locality by controlling:

The number of perturbed samples
The weighting function that assigns importance to samples

4. How to Express the Explanation?

LIME presents explanations using interpretable models. Different formats include:

Feature Importance Scores: Shows which features contributed most to the prediction.
Decision Trees/Rules: Provides rule-based explanations.
Example-based Explanations: Shows similar instances with their outputs.

Customization: Users can choose the type of explanation that suits their audience (e.g., engineers may prefer weights, while non-technical users may prefer decision rules).

SHAP: Shapley Values as Importance

At a very high level, what SHAP is trying to do is to estimate the marginal contribution of each feature towards the prediction and you average this contribution across all possible permutations.

-Marginal contribution of each feature towards the prediction, averaged over all possible permutations

-Fairly attributes the prediction to all the features

The image illustrates how Shapley values measure the contribution of a feature xᵢ to a model's prediction. The process involves:

Full Feature Set (O): The model predicts P(y) = 0.9 when using all features.
Feature Removed (O / xᵢ): The model predicts P(y) = 0.8 when xᵢ is removed.
Shapley Contribution (M(xᵢ, O)): The difference (0.9 - 0.8 = 0.1) represents the contribution of xᵢ.

Rule Based: Anchors

-Rule based variant of LIME

-Identify the conditions under which the classifier has the same prediction

So what we do is if we want the explanation of an instance x, you perturb that instance x to generate the local neighborhood just like you did for LIME, and then what you do is you try to find a rule that sort of correctly covers that local neighborhood.

LIME explanation will basically tell you the importance and the direction of the importance (positive or negative) for each feature whereas anchor explanation is basically that rule that is sort of covering the local neighborhood.

Saliency Maps:

Saliency maps are a visualization technique used to explain deep learning models, especially in computer vision. They highlight the most important pixels in an image that influence a model's prediction.

How Do They Work?

The idea is to compute the gradient of the model’s output with respect to the input image.
The absolute values of these gradients indicate how much a small change in each pixel would affect the model’s decision.
The result is a heatmap overlay on the original image, showing the most important regions.

Challenges: • Visually noisy & difficult to interpret • Gradient saturation

In some deep neural networks, especially those with activation functions like ReLU, the gradients may become saturated in certain regions of the network, leading to poor or misleading saliency maps.

When gradients saturate (i.e., they become too small or too large), it reduces the ability of the saliency map to highlight significant features that impact the model's prediction.

SmoothGrad is a technique designed to address some of the challenges of saliency maps, particularly noise and instability

SmoothGrad improves saliency maps by averaging gradients over multiple noisy inputs. Instead of using a single input image to generate the saliency map, SmoothGrad adds Gaussian noise to the input image and computes the gradient for each noisy version. The idea is to reduce the noise and highlight more stable, consistent features that contribute to the model's prediction.

Gradient Input: In Gradient Input, the gradient of the model's output (e.g., class probability) is computed with respect to the input features (e.g., pixels in an image). We don't just compute the gradient, we also do a dot product with the point x itself.

Prototypes/Example-based

Use examples (synthetic or natural) to explain individual predictions

Influence Functions (Koh & Liang 2017)
- Identify instances in the training set that are responsible for the prediction of a given test instance.
Activation Maximization (Erhan et al. 2009)
- Identify examples (synthetic or natural) that strongly activate a function (neuron) of interest.

Influence Functions - What we want to ask here is not just look at “what pieces of the image are influential” to the prediction, but ask the question of which training points have the most influence on the test loss for this particular point and the prediction.

The other approach that we consider under the setting is activation maximization. So this approach, again, the goal is to identify examples that activate a neuron of interest

Implementation flavors: • Search for natural examples within a specified set (training or validation corpus) that strongly activates neuron of interest. This approach involves searching through a pre-existing dataset to find examples that maximize the activation of a particular neuron. Challenges: The downside of this approach is that it might not always find a clear, specific example that maximally activates the neuron • Synthesize examples, typically via gradient descent

Instead of searching for natural examples, this approach involves generating entirely new input examples. This is typically done by applying gradient ascent or gradient descent to modify a random input (e.g., a noise image) in a way that maximizes the activation of the neuron.

Challenges: The main challenge is that the generated examples might not always resemble natural inputs, especially in complex networks. In some cases, the generated images or inputs might look highly unnatural or abstract, making them hard to interpret in a human-understandable way.

Counterfactuals– Counterfactual Explanations

- provide recourse to affected individuals

Counterfactual explanations tell us what features need to be changed and by how much to flip a model's prediction i.e., to reverse an unfavorable outcome.

Our goal is there is some point x on this negative labeled area of the model decision boundary and you want to find another point in the positively labeled area of the decision boundary that x can morph into. Take the point x and keep perturbing it and pushing it towards the decision boundary. And once it crosses the decision boundary, stop. Now the question here is if I take x and start perturbing it towards a decision boundary, it can either go one way and become this point CF2 or it can go another way and become this point CF1. Which should it become? So that is basically where different approaches differ. So in some sense, the proposed algorithms for solving this problem differ on how to choose among these candidate counterfactuals, and the second thing is how much access is needed to the underlying predictive model, whether they can work with a black box or whether they need access to the gradients of the underlying model.

Minimum Distance Counterfactuals:

So the idea is, given a point x, we want to find a point x’ such that the distance between x and x’ is as small as possible and the model's output on x’ is a positive label. So you want to find the closest instance to the original instance you started with, such that these two are pretty close, but also the label of this new instance that you're finding is a positive label. And then you ask x to change into x prime basically.

So x’ is called your counterfactual and x is your original instance and d can be some distance metric whether it's L2 distance or some other kinds of distance like Manhattan distance.

Feasible and Least Cost Counterfactuals:

A is the set of feasible counterfactuals (input by end user) E.g., changes to race or gender are not feasible cost is modelled as a total log-percentile shift (replaces distance in initial eq.) Changes become harder when starting off from a higher percentile value -

It's saying some features might be harder to change for a person in practice than others. Maybe for somebody who is living in a particular place it is easy to increase the size of their house, but not their salary. And for somebody in another place, the vice versa could be true, how can we incorporate those better when recommending these changes? So let's not just look at the distance in the input space, but look at the cost associated with changing certain features when prescribing this kind of recourse.

And as a first solution to this, this particular approach sort of models cost as a log percentile shift, which basically models this aspect that changes become harder when you are trying to make changes at a higher percentile shift.

So changing from 90 to 95 might be much harder than going from 50 to 55 percentile, for example.

What if we have a black box or a non-linear classifier?

Solution: Generate a local linear model approximation (e.g., using LIME) and then apply [Ustunetal.,2019]’s framework

Causally Feasible Counterfactuals:

The idea of causally feasible counterfactuals is to generate counterfactuals that adhere to the causal relationships and constraints inherent in the system, ensuring that the changes made to the input are not arbitrary but consistent with the underlying causal mechanisms.

• Requires knowledge of full causal graph - Learn about feasibility constraints/partial causal graph from user inputs • Solving the objective - Requires access to the gradients of the underlying predictive model

Data Manifold Closeness: Generated counterfactual should be ”close to” the original data distribution Sparsity: Ideal to change small number of features in the counterfactual

GLOBAL EXPLANATION

• Explain the complete behavior of a given (black box) model • Help detect big picture model biases persistent across larger subgroups of the population • Global explanations are complementary to local explanations

Collection of Local Explanations

How to generate a global explanation of a (black box) model?

• Generate a local explanation for every instance in the data using LIME or SHAP • Pick a subset of k local explanations to constitute the global explanation

SP-LIME:

LIME explains a single prediction - Local behaviour for a single instance

Can’t examine all explanations - Instead pick k explanations to show to the user

Criteria: • Representative: Should summarise the model’s global behaviour • Diverse: Should not be redundant in their descriptions

SP-LIME uses submodular optimisation and greedily picks k explanations → Try to pick explanations that maximally differs from the explanations that we have already picked

model agnostic - no need to access gradients of the model, architecture of the model

Representation Based

-Derive model understanding by analysing intermediate representations of a DNN

-Determine model’s reliance on ”concepts” that are semantically meaningful to humans

Network Dissection: 1. Identify a broad set of human-labelled visual concepts 2. Gather the response of hidden variables (convolutional filters) to known concepts 3. Quantify alignment of hidden variable-concept pairs

1. How similar are the representations at the lower layers of a model compared to its higher layers?

In a deep learning model (such as a convolutional neural network or a transformer), lower layers typically capture more local or generic features, while higher layers capture more abstract or task-specific features.

Lower layers of a model often encode basic patterns or features such as edges, textures, and shapes (in image models) or basic syntactic structures (in NLP models). These features are often general and applicable to a wide range of tasks.
Higher layers, on the other hand, combine these low-level features into more complex representations, encoding more abstract concepts that are more relevant to the specific task the model is trained on (e.g., object recognition, sentiment analysis).

Model Distillation

The goal is you will start with two sets of images. One set of images represents the notion of stripes. The other set of images represents random examples, and then you get the vectors for all of those images from the model, from each of the layers actually. And then you try to build linear classifiers that separate the two classes of vector activations, and you take the vector that is orthogonal to the decision boundary pointing in the direction of the concept of interest. And then you compute derivatives by leveraging this vector to determine the importance of the notion of stripes on a given prediction. So the key thing here is we went from having the concept of stripes in our mind to getting a vector which quantifies what is the notion of stripes to the model. So let's move to the next class of global exploration approaches. This is called modern distillation and has actually been there for quite a long time and also is popular among especially these tabular sort of data environments.

So let's say you have a bunch of data points here that you can throw at this model. You can get the predictions of this model on each of those data points. You have query access to this predictive model, and you have access to the data set and the corresponding model predictions. Now, you take all this, pass it through an explainer algorithm, and then sort of approximate these predictions or either try to mimic these predictions using a simpler interpretable model, which is sort of a lot more easier to understand. So, essentially, take these model predictions and input instances and try to approximate or mimic the predictions of this model but now using a simpler model.

Reference: (1) Stanford Seminar - ML Explainability Part 1 I Overview and Motivation for Explainability - YouTube