Shap: Shapley values for variable importance?

Created on 14 Jan 2018 · 16Comments · Source: slundberg/shap

Hi,

I'm curious if there is some way to use Shapley values to estimate global variable importance metrics for a model? Say I use dataset X to train classifier M, then run shap on every observation in X. Is there some principled way to extract variable importance measures from the resulting Shapley matrix? I could imagine a few potential options, e.g. featurewise means of absolute values or featurewise variances, but these all seem a little ad hoc. Perhaps the SVD of the Shapley matrix may be informative here?

Thanks,
David

Source

dswatson

Most helpful comment

The summary_plot function is designed to get a good visual summary of which features are most important. By plotting the distribution of a feature's importance over all observations we get a much better idea of its effect than could be captured by a single number. That being said, I have used the average absolute SHAP value to rank the features in the summary plot.

When we collapse a feature's importance across all samples to a single number we are forced to decide what we want to measure. For example, is a feature with high effect on a small number of observations more important than a feature with a small effect on many observations? I think the answer to this is application dependent.

A related question is assigning a P-value to a feature, and this is something I hope to release a notebook on soon.

slundberg on 15 Jan 2018

👍8

All 16 comments

A related question is assigning a P-value to a feature, and this is something I hope to release a notebook on soon.

slundberg on 15 Jan 2018

👍8

That's a good point, the full distributions are definitely more informative than a single number. P-values could be a nice touch though, so I'm curious to see what you come up with there. On a related note, I'm wondering if there's some straightforward way to calculate confidence intervals for Shapley value estimates? Seems tough to do without introducing some parametric assumptions...but then again, maybe those assumptions are justified?

dswatson on 15 Jan 2018

You might be able to make some progress on a parametric approximation if you knew the hessian of the model you were explaining. But in practice, if you have access to the model, I would just recommend bootstrapping (retrain the model on many bootstrap resamples of your training data).

slundberg on 15 Jan 2018

Oof, that could be really time consuming for a complex model trained on a large dataset. But I suppose it's really the only completely nonparametric way to estimate confidence intervals here...I'm curious how you'd define the null distribution for a feature's Shapley values? I'm imagining some sort of t-test like you'd use with linear model coefficients, but again, we would need standard errors to make those calculations. I guess I'll just have to wait for that notebook on P-values!

dswatson on 16 Jan 2018

Hi Scott,

Have you been able to look into the potential of a p-value and confidence intervals? I believe your package can be an excellent source for statistical tests, but I am not absolutely sure how to gather such values myself. This package by Susan Athey from the causal inference community adapted random forest just for the purpose of deriving statistical niceties. https://github.com/swager/grf

Let me know what you think.

Regards,
Derek

snowde on 1 Jul 2018

I haven't yet posted anything on p-values, but you can use bootstrap resampling (retrain the model many times) to get confidence intervals on the SHAP values. I recommend using the the dot product between the SHAP values for a single feature on the original dataset vs. the bootstrapped samples to measure a global feature feature confidence interval. The R package you point out looks nice, they seem to be specifically deriving some asymptotic distributions for confidence intervals which avoid the need for bootstrap sampling.

slundberg on 2 Jul 2018

Can you elaborate on this, if not feel free to defer me to the future notebook. "I recommend using the the dot product between the SHAP values for a single feature on the original dataset vs. the bootstrapped samples to measure a global feature feature confidence interval."

snowde on 2 Jul 2018

If you train an XGBoost model and then explain it on your training dataset
to get a matrix of shap_values you can consider a single column the
importance of a feature across all samples. It will have both positive and
negative values. What you want to know is if those values have a meaningful
trend or are just driven by random noise. To evaluate this you can retrain
your model on a bootstrap resample of your dataset and then explain it
again on your original training data to get another matrix of shap_values.
If you take the dot product (or correlation) between two of the same
columns in each matrix you will see how well the impacts of a feature in
the first model agree with the impacts of that same feature in the other
model. By repeating this many times you will get an estimate of the global
stability of a feature. (if the correlation is consistently greater than 0
then you have a significant feature)

There may be better ways to do this, but that's what I have done.

On Mon, Jul 2, 2018 at 12:18 PM snowde notifications@github.com wrote:

Can you elaborate on this, if not feel free to defer me to the future
notebook. "I recommend using the the dot product between the SHAP values
for a single feature on the original dataset vs. the bootstrapped samples
to measure a global feature feature confidence interval."

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
https://github.com/slundberg/shap/issues/13#issuecomment-401906786, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADkTxSHEJ6iEGNnsr6wP3RLz0NNTbuOWks5uCnIigaJpZM4Rdq9_
.

slundberg on 2 Jul 2018

👍1

Hi,

Sorry if my question is not so relevant. But if I understood you suggest to apply a dot product between the SHAP values and a column vector representing a feature across all samples of the dataset. But if the original dataset contains a large number of examples this value probably is bigger than the bootstrapped dataset. My question is shouldn't normalize the dot product ?

Thanks,
Max

msobroza on 21 Aug 2018

@msobroza you can if you want. If you are just checking the sign then it won't matter.

slundberg on 23 Aug 2018

👍1

@slundberg: First of all, thanks for the great package!
Is it possible to get a list / an array of average shapley values (instead of a plot)?
EDIT: Thanks I found a solution!

AlxndrMlk on 4 Oct 2018

👍1

plus 1.
I would be happy to have a get a list with feature coefficients.

unnir on 19 Mar 2019

Just do np.abs(shap_values).mean(0)

slundberg on 20 Mar 2019

👍3

If you train an XGBoost model and then explain it on your training dataset to get a matrix of shap_values you can consider a single column the importance of a feature across all samples. It will have both positive and negative values. What you want to know is if those values have a meaningful trend or are just driven by random noise. To evaluate this you can retrain your model on a bootstrap resample of your dataset and then explain it again on your original training data to get another matrix of shap_values. If you take the dot product (or correlation) between two of the same columns in each matrix you will see how well the impacts of a feature in the first model agree with the impacts of that same feature in the other model. By repeating this many times you will get an estimate of the global stability of a feature. (if the correlation is consistently greater than 0 then you have a significant feature) There may be better ways to do this, but that's what I have done.
…
On Mon, Jul 2, 2018 at 12:18 PM snowde @.*> wrote: Can you elaborate on this, if not feel free to defer me to the future notebook. "I recommend using the the dot product between the SHAP values for a single feature on the original dataset vs. the bootstrapped samples to measure a global feature feature confidence interval." — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ADkTxSHEJ6iEGNnsr6wP3RLz0NNTbuOWks5uCnIigaJpZM4Rdq9_ .

This is a nice suggestion. However for it to be statistically conclusive, there would have to be a lot of retraining done on many resampled datasets. In situations where that exercise is prohibitive (large datasets, limited compute resources, time constraints) what would you suggest as an alternative?

Btw, many thanks for this wonderful resource that you have chosen to make open-source!

hrsuraj on 20 Aug 2019

@hrsuraj you are right that bootstrapping can be expensive computationally. I don't know of any good closed form alternatives here though. One suggestion is that you don't have to use an entire large dataset for the explanations (though you may still want to for training).

One other thought is that if you have some type of bootstrapped random forest model you might be able to just sub-select trees instead of retraining the whole model.

slundberg on 1 Sep 2019

If you train an XGBoost model and then explain it on your training dataset to get a matrix of shap_values you can consider a single column the importance of a feature across all samples. It will have both positive and negative values. What you want to know is if those values have a meaningful trend or are just driven by random noise. To evaluate this you can retrain your model on a bootstrap resample of your dataset and then explain it again on your original training data to get another matrix of shap_values. If you take the dot product (or correlation) between two of the same columns in each matrix you will see how well the impacts of a feature in the first model agree with the impacts of that same feature in the other model. By repeating this many times you will get an estimate of the global stability of a feature. (if the correlation is consistently greater than 0 then you have a significant feature) There may be better ways to do this, but that's what I have done.

I'm wondering if it would be reasonable to estimate the significance of a variable for a fixed model by simply bootstrap re-sampling the calculation of np.abs(shap_values).mean(0) over a large set of shap_value samples (training or validation data, depending on your goals). this would give you a confidence interval on the mean absolute shap value for each feature, and would not require retraining. Of course you would also lose the source of variation in the model fitting procedure. Since this hasn't been mentioned, is this not a robust way of assessing whether a variable has a significant impact on the behavior of a model?