Hi,
I am using the latest XGBoost 1.1.1 with the scikit-learn interface. I trained a model and saved with joblib.dump, then I loaded it with joblib.load and I wanted to use shap package to compute feature importance, but the line explainer = shap.TreeExplainer(model.get_booster()) gives the following error:
Traceback (most recent call last):
File "boost.py", line 802, in <module>
explainer = shap.TreeExplainer(model.get_booster())
File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/shap/explainers/tree.py", line 121, in __init__
self.model = TreeEnsemble(model, self.data, self.data_missing, model_output)
File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/shap/explainers/tree.py", line 726, in __init__
xgb_loader = XGBTreeModelLoader(self.original_model)
File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/shap/explainers/tree.py", line 1326, in __init__
self.name_obj = self.read_str(self.name_obj_len)
File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/shap/explainers/tree.py", line 1456, in read_str
val = self.buf[self.pos:self.pos+size].decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 341: invalid start byte
This error is not shown with the previous xgboost 1.0.2, so I think it is related to some new data format employed in the latest version that shap does not recognize.
Can you reproduce this issue?
Many thanks
I can reproduce this with xgboost 1.1.1 and 1.1.0. I think it is to do with the changes in serialization with 1.1.0.: https://github.com/dmlc/xgboost/releases/tag/v1.1.0
I can reproduce this with xgboost 1.1.1 and 1.1.0. I think it is to do with the changes in serialization with 1.1.0.: https://github.com/dmlc/xgboost/releases/tag/v1.1.0
Looking forward to a fix for this issue in the next SHAP release...
I get the same error trying to run README.md example code
import xgboost
import shap
# load JS visualization code to notebook
shap.initjs()
# train XGBoost model
X,y = shap.datasets.boston()
model = xgboost.train({"learning_rate": 0.01}, xgboost.DMatrix(X, label=y), 100)
# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn and spark models)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:])
Setting feature_perturbation = "tree_path_dependent" because no background data was given.
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-2-b96725122f4c> in <module>
14 # explain the model's predictions using SHAP
15 # (same syntax works for LightGBM, CatBoost, scikit-learn and spark models)
---> 16 explainer = shap.TreeExplainer(model)
17 shap_values = explainer.shap_values(X)
18
~/.pyenv/versions/3.8.2/envs/py382/lib/python3.8/site-packages/shap/explainers/tree.py in __init__(self, model, data, model_output, feature_perturbation, **deprecated_options)
119 self.feature_perturbation = feature_perturbation
120 self.expected_value = None
--> 121 self.model = TreeEnsemble(model, self.data, self.data_missing, model_output)
122 self.model_output = model_output
123 #self.model_output = self.model.model_output # this allows the TreeEnsemble to translate model outputs types by how it loads the model
~/.pyenv/versions/3.8.2/envs/py382/lib/python3.8/site-packages/shap/explainers/tree.py in __init__(self, model, data, data_missing, model_output)
724 self.original_model = model
725 self.model_type = "xgboost"
--> 726 xgb_loader = XGBTreeModelLoader(self.original_model)
727 self.trees = xgb_loader.get_trees(data=data, data_missing=data_missing)
728 self.base_offset = xgb_loader.base_score
~/.pyenv/versions/3.8.2/envs/py382/lib/python3.8/site-packages/shap/explainers/tree.py in __init__(self, xgb_model)
1324 self.read_arr('i', 29) # reserved
1325 self.name_obj_len = self.read('Q')
-> 1326 self.name_obj = self.read_str(self.name_obj_len)
1327 self.name_gbm_len = self.read('Q')
1328 self.name_gbm = self.read_str(self.name_gbm_len)
~/.pyenv/versions/3.8.2/envs/py382/lib/python3.8/site-packages/shap/explainers/tree.py in read_str(self, size)
1454
1455 def read_str(self, size):
-> 1456 val = self.buf[self.pos:self.pos+size].decode('utf-8')
1457 self.pos += size
1458 return val
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 342: invalid start byte
I've found a way around from another similar issue comment
model_barr = model.save_raw()[4:]
model.save_raw = lambda: model_barr
I tried your workaround but a new error came up:
AttributeError: 'XGBClassifier' object has no attribute 'save_raw'
I have xgboost 1.1.1, what is your version?
Sorry, my fault, I was using the scikit-learn interface, so I needed to first extract the "booster" object.
model = xgb.XGBClassifier()
booster = model.get_booster()
model2 = booster.save_raw()[4:]
booster.save_raw = lambda: model2
explainer = shap.TreeExplainer(booster)
values = explainer.shap_values(test_x, test_y)
Thanks! :+1:
xgb = XGBClassifier(random_state=42)
mymodel = xgb.fit(X_train, y_train)
mybooster = mymodel.get_booster()
model_bytearray = mybooster.save_raw()[4:]
def myfun(self=None):
return model_bytearray
mybooster.save_raw = myfun
explainer = shap.TreeExplainer(mybooster)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, X_train)
Most helpful comment
I can reproduce this with xgboost 1.1.1 and 1.1.0. I think it is to do with the changes in serialization with 1.1.0.: https://github.com/dmlc/xgboost/releases/tag/v1.1.0