Shap: Is SHAP appropriate for mostly categorical data?

Created on 25 Jun 2019  路  14Comments  路  Source: slundberg/shap

Hi, firstly thanks for providing SHAP. I have some survey data that I've analyzed using CatBoost. Almost all the features are categorical. I noticed an error in loading a TreeExplainer (minimal reproduction below), and I've since seen this post labelled ToDo: https://github.com/slundberg/shap/issues/292 discussing how categorical variables can't be split in the way SHAP expects. So question is - is SHAP appropriate for this type of data? Any advice here is appreciated!

Minimal example:

from catboost import CatBoostClassifier
import pandas as pd
import shap
print('Cat is:', catboost.__version__)
print('Pandas is:', pd.__version__)
print('Shap is:', shap.__version__)

features = dict()
features['Cat'] = ['Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'No']
features['Hat'] = [1, 1, 1, 1, 0, 0, 0, 0]

cat_hat = pd.DataFrame(features)

model=CatBoostClassifier(iterations=5, 
                         depth=1, 
                         learning_rate=0.1, 
                         loss_function='Logloss', 
                         verbose=False)
model.fit(cat_hat['Cat'].values, cat_hat['Hat'].values, cat_features=[0])


# load JS visualization code to notebook
shap.initjs()
# explain the model's predictions using SHAP values
# (same syntax works for LightGBM, CatBoost, and scikit-learn models)
explainer = shap.TreeExplainer(model)

output:

Cat is: 0.15.1
Pandas is: 0.24.2
Shap is: 0.29.2

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-151-6315d3774046> in <module>
     24 # explain the model's predictions using SHAP values
     25 # (same syntax works for LightGBM, CatBoost, and scikit-learn models)
---> 26 explainer = shap.TreeExplainer(model)

~/miniconda3/envs/lew_conda/lib/python3.7/site-packages/shap/explainers/tree.py in __init__(self, model, data, model_output, feature_dependence)
     94         self.feature_dependence = feature_dependence
     95         self.expected_value = None
---> 96         self.model = TreeEnsemble(model, self.data, self.data_missing)
     97 
     98         assert feature_dependence in feature_dependence_codes, "Invalid feature_dependence option!"

~/miniconda3/envs/lew_conda/lib/python3.7/site-packages/shap/explainers/tree.py in __init__(self, model, data, data_missing)
    592             self.dtype = np.float32
    593             cb_loader = CatBoostTreeModelLoader(model)
--> 594             self.trees = cb_loader.get_trees(data=data, data_missing=data_missing)
    595             self.tree_output = "log_odds"
    596             self.objective = "binary_crossentropy"

~/miniconda3/envs/lew_conda/lib/python3.7/site-packages/shap/explainers/tree.py in get_trees(self, data, data_missing)
   1148             # split features and borders go from leafs to the root
   1149             for elem in self.loaded_cb_model['oblivious_trees'][tree_index]['splits']:
-> 1150                 split_features_index.append(elem['float_feature_index'])
   1151                 borders.append(elem['border'])
   1152 

KeyError: 'float_feature_index'

Most helpful comment

It would be really nice to add some tests for catboost to shap library to not break anything in the future.

All 14 comments

@ljmartin I've seen this when I tried to upgrade. You may want to use catboost 0.14.2 and shap 0.28.3 - seems the only compatible versions among recent ones.

@annaveronika can you remember any related changes in Catboost 0.15?

I got the same problem. I found CatBoostTreeModelLoader in latest shap cannot take care of categorical features.

{'split_index': 1272, 'float_feature_index': 174, 'split_type': 'FloatFeature', 'border': 4296.5}
{'split_index': 887, 'float_feature_index': 156, 'split_type': 'FloatFeature', 'border': 0.024754902347922325}
{'split_index': 2758, 'split_type': 'OnlineCtr', 'border': 10.999999046325684, 'ctr_target_border_idx': 0}

SHAP in principle works fine for categorical data. However there are two issues you can run into with it:

  1. CatBoost has a special way of doing categorical splitting that (when used) essentially creates new features to split on that are not in the original set of input features. These features allow you to split whole groups of categories one way or the other. Support for these types of trees have not been added to SHAP yet. (@annaveronika correct me if you have support for this inside CatBoost)

  2. When providing categorical data some packages allow you to give string values, that the package then encodes as numerical values for you. SHAP does not directly support this step, so you have to have already encoded the strings into numeric categorical values.

@slundberg I am having a similar problem: when I try to use SHAP with CatBoost that trained on data with categorical variables, I get "float_feature_index" error. Is there a workaround for CatBoost? Without encoding the data, I mean, since CatBoost's way to deal with it is one of its advantages and I would like to avoid manual encoding.

We are looking into it right now.

@annaveronika thanks for your quick answer. I also saw your post stating that CatBoost has shapley values implemented (https://github.com/slundberg/shap/issues/58). Why have you deleted the tutorial for that? Is there an issue?

It looks like there are some problems in shap library code when calculating shap values for catboost.
I suggest for now to use catboost code for that, an example of the code is in the link above.
It looks like that:

model.fit(X, y)
shap_values = model.get_feature_importance(Pool(X, y), type='ShapValues')

And we'll try to fix the code, so that TreeExplainer also works.

The error occurs here, and it occurs because someone just did not take into account that there are categorical features in CatBoost. This code here will fix everything:

for elem in self.loaded_cb_model['oblivious_trees'][tree_index]['splits']:
    split_type = elem.get('split_type')
    if split_type == 'FloatFeature':
        split_feature_index = elem.get('float_feature_index')
        borders.append(elem['border'])
    elif split_type == 'OneHotFeature':
        split_feature_index = elem.get('cat_feature_index')
        borders.append(elem['value'])
    else:
        split_feature_index = elem.get('ctr_target_border_idx')
        borders.append(elem['border'])
     split_features_index.append(split_feature_index)

BUT
Why is a dense representation of a tree built for a CatBoost model at all? There is no need, since all shap values are available from the CatBoost.

I suggest just get everything back as it was in 0.29.1 (just remove these lines)

It would be really nice to add some tests for catboost to shap library to not break anything in the future.

I agree we need to add some catboost tests. I鈥檒l look into this on Monday. (The reason an internal version is built is to support some extra features available in the shap impl.)

@slundberg Do you plan for a fix?

is this issue fixed in the latest version of catboost? I am trying to give string values in my categorical variables and it throws the error _catboost.CatBoostError: Bad value for num_feature[0,0]="ABC": Cannot convert 'b'ABC'' to float
Even when using get_feature_importances to retrieve shap values I encounter the same error.
A similar issue #1042 was raised recently in catboost library. I don't want to manually encode my categorical features

Was this page helpful?
0 / 5 - 0 ratings

Related issues

grofte picture grofte  路  4Comments

gabrielcs picture gabrielcs  路  3Comments

ArpitSisodia picture ArpitSisodia  路  3Comments

GitAnalyst picture GitAnalyst  路  3Comments

franciscorodriguez92 picture franciscorodriguez92  路  4Comments