Shap: how to extract the most important feature names ?

Created on 6 Jun 2019 · 8Comments · Source: slundberg/shap

We can visualize the feature importance by calling summary_plot, however it only outputs the plot, not any text.

Is there any way to output the feature names sorted by their importance defined by shape_values?

Something like:

def get_feature_importance(shape_values_matrix):
    ... ???
    return np.array([...])

x = get_feature_importance(shape_values_matrix)
# x == np.array(['feature_1', 'feature_2', ... 'feature_n'])

Thanks.

Source

fyears

Most helpful comment

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

thoo on 3 Jul 2019

👍23 ❤6 😄2

All 8 comments

The array returned by shap_values is the parallel to the data array you explained the predictions on, meaning it is the same shape as the data matrix you apply the model to. That means the names of the features for each column are the same as for your data matrix. If you have those names around somewhere as a list you can pass them to summary_plot using the feature_names argument.

slundberg on 6 Jun 2019

I understand that shap_values_matrix.shape == train_data_matrix.shape.

clarifying

My issue, is about getting the names and their shap values of features, instead of visualizing them.

example

Take the demo in readme.md for example, we know the summary_plot plots the following figure:

shap.summary_plot(shap_values, X, plot_type="bar")

boston_summary_plot_bar

However, I am wondering, is there any way to get the name of features ordered by importance:

# is there any function working like get_feature_importance? Or how to implement it?
shap.get_feature_importance(shap_values, X) == np.array(['LSTAT', 'RM', 'CRIM', ... 'CHAS'])

Or even better, output the numeric values of each feature:

# is there any function working like get_feature_importance_2? Or how to implement it?
shap.get_feature_importance_2(shap_values, X) == {
  'LSTAT': 2.6,
  'RM': 1.7,
  ...,
  'CHAS': 0.0
}

Thank you so much.

fyears on 10 Jun 2019

Ah. Well the numbers for the bar chart are just np.abs(shap_values).mean(0), so if you take X.columns[np.argsort(np.abs(shap_values).mean(0))] you should get the feature names in order (but note I didn't run that code to test it). You could then zip that up into a dictionary if you like.

slundberg on 20 Jun 2019

👍6

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

thoo on 3 Jul 2019

👍23 ❤6 😄2

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

I think the shap_values format has changed.
For me was necessary to put a sum function:

import numpy as np
vals= np.abs(shap_values).mean(0)

feature_importance = pd.DataFrame(list(zip(features.columns, sum(vals))), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

clappis on 27 Dec 2019

👍11 ❤1

I think it would be useful to have this as a "feature_importance()" method, which I will look at adding as a pull request if anyone agrees?

On a different note, but related to this topic... I have only read summary information on "shap" and from my understanding, the shap_values are related to the individual observations (not at the global level) as such, I wondered how to use for an obscure cross validation situation I have... I would like to find feature importances based on the training data of all the folds during CV (fully stratified folds, so no danger of missing variables/values in a particular fold etc). I wondered if I should:
1) get a normalised absolute mean for each fold (i.e., the sum of all features in that fold add up to 100%) of the shap values and then average it amongst the folds...
2) for each fold, get the shap values and append them into a shared large table; this is then used to calculate the absolute mean...
3) none of the above?

JohnStott on 18 Feb 2020

👍8

if the feature importance does not adds upto one ? is it incorrect?

TAMANNA08 on 1 Oct 2020

import numpy as np
vals= np.abs(shap_values).mean(0)

feature_importance = pd.DataFrame(list(zip(features.columns, sum(vals))), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

For anyone seeing this more recently, even if you are doing a binary classification problem, the returned shap_values is now a list of matrices (as mentioned in the TreeExplainer documentation) even if one class' shap values (say the positive class) 1 - the neg class shap values (in case of probabilities).

To reflect that in the code snippet above, I had to specifically select the positive class in my case in order to land on correct shap values:

import numpy as np
vals= np.abs(shap_values[1]).mean(0)

jadhosn on 23 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

plot_cmap not working in all cases

gabrielcs · 3Comments

shap.force_plot() not working with TF 2.1

brookewenig · 3Comments

Support for tf.keras?

artemmavrin · 4Comments

Problem in LIGHTGBM and CatBoost using TreeExplainer with model_output = "probibility"

SaadAhmed96 · 3Comments

GradientExplainer has no expected values: how to interpret it.

1vecera · 3Comments