Shap: how to extract the most important feature names ?

Created on 6 Jun 2019  路  8Comments  路  Source: slundberg/shap

We can visualize the feature importance by calling summary_plot, however it only outputs the plot, not any text.

Is there any way to output the feature names sorted by their importance defined by shape_values?

Something like:

def get_feature_importance(shape_values_matrix):
    ... ???
    return np.array([...])

x = get_feature_importance(shape_values_matrix)
# x == np.array(['feature_1', 'feature_2', ... 'feature_n'])

Thanks.

Most helpful comment

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

All 8 comments

The array returned by shap_values is the parallel to the data array you explained the predictions on, meaning it is the same shape as the data matrix you apply the model to. That means the names of the features for each column are the same as for your data matrix. If you have those names around somewhere as a list you can pass them to summary_plot using the feature_names argument.

I understand that shap_values_matrix.shape == train_data_matrix.shape.

clarifying

My issue, is about getting the names and their shap values of features, instead of visualizing them.

example

Take the demo in readme.md for example, we know the summary_plot plots the following figure:

shap.summary_plot(shap_values, X, plot_type="bar")

boston_summary_plot_bar

However, I am wondering, is there any way to get the name of features ordered by importance:

# is there any function working like get_feature_importance? Or how to implement it?
shap.get_feature_importance(shap_values, X) == np.array(['LSTAT', 'RM', 'CRIM', ... 'CHAS']) 

Or even better, output the numeric values of each feature:

# is there any function working like get_feature_importance_2? Or how to implement it?
shap.get_feature_importance_2(shap_values, X) == {
  'LSTAT': 2.6,
  'RM': 1.7,
  ...,
  'CHAS': 0.0
}

Thank you so much.

Ah. Well the numbers for the bar chart are just np.abs(shap_values).mean(0), so if you take X.columns[np.argsort(np.abs(shap_values).mean(0))] you should get the feature names in order (but note I didn't run that code to test it). You could then zip that up into a dictionary if you like.

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

I usually do this to get feature importance.

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

I think the shap_values format has changed.
For me was necessary to put a sum function:

import numpy as np
vals= np.abs(shap_values).mean(0)

feature_importance = pd.DataFrame(list(zip(features.columns, sum(vals))), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

I think it would be useful to have this as a "feature_importance()" method, which I will look at adding as a pull request if anyone agrees?

On a different note, but related to this topic... I have only read summary information on "shap" and from my understanding, the shap_values are related to the individual observations (not at the global level) as such, I wondered how to use for an obscure cross validation situation I have... I would like to find feature importances based on the training data of all the folds during CV (fully stratified folds, so no danger of missing variables/values in a particular fold etc). I wondered if I should:
1) get a normalised absolute mean for each fold (i.e., the sum of all features in that fold add up to 100%) of the shap values and then average it amongst the folds...
2) for each fold, get the shap values and append them into a shared large table; this is then used to calculate the absolute mean...
3) none of the above?

if the feature importance does not adds upto one ? is it incorrect?

import numpy as np
vals= np.abs(shap_values).mean(0)

feature_importance = pd.DataFrame(list(zip(features.columns, sum(vals))), columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'], ascending=False,inplace=True)
feature_importance.head()

For anyone seeing this more recently, even if you are doing a binary classification problem, the returned shap_values is now a list of matrices (as mentioned in the TreeExplainer documentation) even if one class' shap values (say the positive class) 1 - the neg class shap values (in case of probabilities).

To reflect that in the code snippet above, I had to specifically select the positive class in my case in order to land on correct shap values:

import numpy as np
vals= np.abs(shap_values[1]).mean(0)
Was this page helpful?
0 / 5 - 0 ratings

Related issues

gabrielcs picture gabrielcs  路  3Comments

brookewenig picture brookewenig  路  3Comments

artemmavrin picture artemmavrin  路  4Comments

SaadAhmed96 picture SaadAhmed96  路  3Comments

1vecera picture 1vecera  路  3Comments