Shap: Feature Request: add support for sklearn.ensemble.HistGradientBoostingRegressor / Classifier to TreeExplainer

Created on 3 Feb 2020 · 6Comments · Source: slundberg/shap

Steps to reproduce:

import shap
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor

# load JS visualization code to notebook
shap.initjs()

# train a tree-based model
X, y = shap.datasets.diabetes()

# model = GradientBoostingRegressor().fit(X, y)  # works for exact GBRT
model = HistGradientBoostingRegressor().fit(X, y)

# explain the model's predictions using SHAP
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

# visualize the first prediction's explanation (use matplotlib=True
# to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0, :], X.iloc[0, :])

```python-traceback
/tmp/shap_demo.py in
15 # explain the model's predictions using SHAP
16
---> 17 explainer = shap.TreeExplainer(model)
18
19 shap_values = explainer.shap_values(X)

~/miniconda3/envs/pylatest/lib/python3.7/site-packages/shap/explainers/tree.py in __init__(self, model, data, model_output, feature_perturbation, **deprecated_options)
110 self.feature_perturbation = feature_perturbation
111 self.expected_value = None
--> 112 self.model = TreeEnsemble(model, self.data, self.data_missing)
113
114 if feature_perturbation not in feature_perturbation_codes:

~/miniconda3/envs/pylatest/lib/python3.7/site-packages/shap/explainers/tree.py in __init__(self, model, data, data_missing)
752 self.tree_output = "probability"
753 else:
--> 754 raise Exception("Model type not yet supported by TreeExplainer: " + str(type(model)))
755
756 # build a dense numpy version of all the tree objects

Exception: Model type not yet supported by TreeExplainer:

## Implementation notes

The code of the new `HistGradientBoostingRegressor` classifier is different from other tree-based models in scikit-learn but it should quite easy to adapt the code to leverage de structure of the `model._predictors` collection. The source code of the `TreePredictor` datastructure is here:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_hist_gradient_boosting/predictor.py
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx


The nodes of the predictors are detailed in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_hist_gradient_boosting/common.pxd
which is mapped to the PREDICTOR_RECORD_DTYPE array datatype:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_hist_gradient_boosting/common.pyx

```python
PREDICTOR_RECORD_DTYPE = np.dtype([
    ('value', Y_DTYPE),
    ('count', np.uint32),
    ('feature_idx', np.uint32),
    ('threshold', X_DTYPE),
    ('missing_go_to_left', np.uint8),
    ('left', np.uint32),
    ('right', np.uint32),
    ('gain', Y_DTYPE),
    ('depth', np.uint32),
    ('is_leaf', np.uint8),
    ('bin_threshold', X_BINNED_DTYPE),
])

This is considered private API of scikit-learn but it should be quite easy to update the explainer code in the unlikely case of change.

Source

ogrisel

All 6 comments

Thanks for noting this! I decided to go ahead and add support, but I still have an issue where when the data point lands exactly on a threshold SHAP's C code is doing something different than sklearn HistGradientBoostingRegressor. We are both doing <= and using np.float64...so I'll need to keep digging into what is up there.

slundberg on 7 Feb 2020

👍1

Found the issue, it looks like GradientBoostingRegressor uses np.float32 input types but the Hist version uses sklearn.ensemble._hist_gradient_boosting.common.X_DTYPE which is np.float64.

slundberg on 7 Feb 2020

👍1

Indeed. Maybe we should make it possible to also use float32 thresholds if the training data was 32bit float originally (before binning).

ogrisel on 7 Feb 2020

I just pushed support for GradientBoostingRegressor and GradientBoostingClassifier. The one outstanding issue is that explaining the loss or probability output of multi-output GradientBoostingClassifier is not yet supported (you can only explain the margin) since it would require a significant refactor of some of our C++ code to support transformations that depend on multiple outputs simultaneously (like softmax). So I'll leave that for future work.

slundberg on 8 Feb 2020

👍1

Thank you very much @slundberg!

ogrisel on 8 Feb 2020

👍1

maybe a new release would be nice ?