Shap: Possible problem with using conditional vs marginal expectation for dropped features in Tree SHAP?

Created on 5 Nov 2019 · 8Comments · Source: slundberg/shap

Hi Scott,

First, thanks for this great package!

I'd like to mention a recent paper which describes a conceptual critique of an aspect of the SHAP package, and to get your opinion on the matter. In Feature relevance quantification in explainable AI: A causality problem it is claimed that in attributing the difference between f(x) and E[f(x)] to individual features, using the conditional expectation for the feature dropping is misleading. The critique seems to rest on a causal argument, described in section 3 of the paper. The paper further claims that using a 'marginal expectation', without conditioning, produces a more reliable attribution, demonstrating this with a very simple example and some numerics.

Thanks again!

Source

ashermullokandov

Most helpful comment

Thanks for asking! That paper Feature relevance quantification in explainable AI: A causality problem provides a very nice way of justifying why it is a good idea to perturb sets of features independently from the other features we are not perturbing (which is what the SHAP package already does when possible). The argument is that we want to know what would happen if we set a feature's value to something (written as E[f(X) | do(X = x_s)]), not just that we observe a feature's value to be something (written as E[f(X) | do(X = x_s)]).

To help clarify that SHAP aligns with casual interventional perturbations (as described in that paper) I have actually renamed the feature_dependence="independent" option to feature_perturbation="interventional". Ironically I had used the "interventional" wording a while ago for the same reason this paper describes (#269), but then dropped it because I feared it might confuse people. Now that there is a good paper to point to for details I think it is worth using it again.

It is important to note that one context where using feature_perturbation="interventional" is not possible is when we are using TreeExplainer(model) without giving a background dataset. When you don't give a background dataset it is not possible to perturb features entirely causally (since you only know how the training samples traversed the tree, not the feature values of those samples). In this case SHAP defaults (now with a warning) to feature_perturbation="tree_path_dependent". This option is not exactly a causal explanation (as defined in that paper), but it is the best you can do without a background dataset, and importantly it does not suffer from the problem of putting weight on feature not explicitly used by the model.

I should also note here that the paper The many Shapley values for model explanation also makes the same observation that using conditional expectations can put weight on features the model does not explicitly use (since those features could be correlated with features the model does use). The "baseline Shapley" and "randomized baseline Shapley" in that paper are identical to SHAP with feature_perturbation="interventional". I personally find the casual connection a more natural way of thinking about the problem than using individual baselines, but they both lead to the same result.

Finally I should also note that there is more nuance here than I have the time to go into, but I will post a link to a tutorial the covers this once I finish it.

slundberg on 12 Nov 2019

🚀2

All 8 comments

Finally I should also note that there is more nuance here than I have the time to go into, but I will post a link to a tutorial the covers this once I finish it.

slundberg on 12 Nov 2019

🚀2

Just want to note, that the claim in "The many Shapley values for model explanation" that Linearity is violated in CES does not hold up. On the one hand: The Example 4.7 is computed wrong, on the other hand the claim in Remark 4.5 on which it is based seems flawed. Linearity in the situation described in that section 4.3 in that paper will follow from linearity of conditional expectations.

Snake707 on 29 Nov 2019

Hi Scott, thanks for clarification for marginal expectation vs conditional expectation issue. I have a quick question regarding the 'Background Dataset' when you replied why it doesn't work with TreeExplainer. What does that background dataset refer to? Is it the whole dataset including training and testing?

ak1997892 on 7 Aug 2020

@ak1997892 There is no problem with the background dataset. Scott is saying that you need a background dataset when using the interventional approach to feature perturbation. You do not need a background dataset for the tree-path dependant approach.

Shapley values are computed by introducing each feature, one at a time, into a conditional expectation function of the model’s output, f_x (S)=E[f(X)│do(X_S=x_S ) ], and attributing the change produced at each step to the feature that was introduced; then averaging this process over all feature orderings. The subset X_S is the coalition of present features (i.e. we know what values these features take for data instance x).

The background dataset is used for marginalization. The complementary subset of features X_C is missing (i.e. we do not know what values these features take for data instance x). According to the documentation, the absence of a feature is simulated by replacing the feature with one of the values it takes in the background dataset. In order to use feature_perturbation = ‘interventional’, you must have access to background data (for example, the training dataset). You can use feature_perturbation = ‘tree_path_dependent’ when no background data is provided because it can infer the background distribution based on the structure of the model: the tree-path dependent approach is to follow the decision trees and use the number of training instances that ended up in each leaf to represent the background distribution. I guess that feature_perturbation = ‘tree_path_dependent’ is also useful when your aim is to understand the decision-making process of a model whose training data is unavailable.

You can find more info in the documentation https://shap.readthedocs.io/en/stable/ and look for 'shap.TreeExplainer'. You will see that when you choose the feature_perturbation = ‘interventional’ approach, runtime scales linearly with the size of the background dataset you use. So you may not want to use your whole training dataset (depending on how many instances you have in X_train) to limit computational cost. Cheers!

LEMTideman on 10 Aug 2020

👍1

@LEMTideman Thanks for the answer. I understand that if you choose 'tree_path_dependent', it will lead to the conditional expectation way to calculate SHAP. But since conditional expectation was shown as an incorrect way for exact SHAP, I want to try inventional way for tree based models.

In this situation, as Scott suggested, I should provide a background dataset. I'm curious to know which dataset I should use for background dataset? For model training, i have training data and test data. Should I use my whole training data, i.e. X_train, as background data? But for explainer.shap_values(), I'm also supposed to use X_training or one single row, right?

ak1997892 on 10 Aug 2020

I don't think the tree-path dependent approach is incorrect. It does not give you a causal explanation, so if that is what you want, go for the interventional approach. As I mentioned above, it is generally recommended to use a portion of your training data as background data. You can take a bigger or smaller proportion depending on the computational resources you have at your disposition.

LEMTideman on 10 Aug 2020

@LEMTideman Got it! Yes, we are more interested in the causal explanation. And if we use conditional expectation, actually in some case it will provide a wrong result, i.e. SHAP = 0 won't lead to no causal relationship. Will it matter whether we use a bigger proportion or smaller proportion as background data?

ak1997892 on 10 Aug 2020

@ak1997892 This can be found through experimentation. I asked a similar question here, and then just ran some experiments to see what was sufficient for good explanations from my model. I found that 100 training samples seemed to perform just about as well as 10,000, and was far, far faster. Note that I am using DeepSHAP, but the idea should apply to TreeSHAP as well