Shap: UnicodeDecodeError: 'utf-8' codec can't decode byte

Created on 9 Jun 2020 · 7Comments · Source: slundberg/shap

Hi,
I am using the latest XGBoost 1.1.1 with the scikit-learn interface. I trained a model and saved with joblib.dump, then I loaded it with joblib.load and I wanted to use shap package to compute feature importance, but the line explainer = shap.TreeExplainer(model.get_booster()) gives the following error:

Traceback (most recent call last):
  File "boost.py", line 802, in <module>
    explainer = shap.TreeExplainer(model.get_booster())
  File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/shap/explainers/tree.py", line 121, in __init__
    self.model = TreeEnsemble(model, self.data, self.data_missing, model_output)
  File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/shap/explainers/tree.py", line 726, in __init__
    xgb_loader = XGBTreeModelLoader(self.original_model)
  File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/shap/explainers/tree.py", line 1326, in __init__
    self.name_obj = self.read_str(self.name_obj_len)
  File "/home/guido/.virtualenvs/ml/lib/python3.6/site-packages/shap/explainers/tree.py", line 1456, in read_str
    val = self.buf[self.pos:self.pos+size].decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 341: invalid start byte

This error is not shown with the previous xgboost 1.0.2, so I think it is related to some new data format employed in the latest version that shap does not recognize.

Can you reproduce this issue?
Many thanks

Source

GuidoBartoli

👀1

Most helpful comment

I can reproduce this with xgboost 1.1.1 and 1.1.0. I think it is to do with the changes in serialization with 1.1.0.: https://github.com/dmlc/xgboost/releases/tag/v1.1.0

jklaise on 10 Jun 2020

👍2

All 7 comments

I can reproduce this with xgboost 1.1.1 and 1.1.0. I think it is to do with the changes in serialization with 1.1.0.: https://github.com/dmlc/xgboost/releases/tag/v1.1.0

jklaise on 10 Jun 2020

👍2

I can reproduce this with xgboost 1.1.1 and 1.1.0. I think it is to do with the changes in serialization with 1.1.0.: https://github.com/dmlc/xgboost/releases/tag/v1.1.0

Looking forward to a fix for this issue in the next SHAP release...

GuidoBartoli on 15 Jun 2020

I get the same error trying to run README.md example code

import xgboost
import shap

# load JS visualization code to notebook
shap.initjs()

# train XGBoost model
X,y = shap.datasets.boston()
model = xgboost.train({"learning_rate": 0.01}, xgboost.DMatrix(X, label=y), 100)

# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn and spark models)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:])

Setting feature_perturbation = "tree_path_dependent" because no background data was given.
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-b96725122f4c> in <module>
     14 # explain the model's predictions using SHAP
     15 # (same syntax works for LightGBM, CatBoost, scikit-learn and spark models)
---> 16 explainer = shap.TreeExplainer(model)
     17 shap_values = explainer.shap_values(X)
     18 

~/.pyenv/versions/3.8.2/envs/py382/lib/python3.8/site-packages/shap/explainers/tree.py in __init__(self, model, data, model_output, feature_perturbation, **deprecated_options)
    119         self.feature_perturbation = feature_perturbation
    120         self.expected_value = None
--> 121         self.model = TreeEnsemble(model, self.data, self.data_missing, model_output)
    122         self.model_output = model_output
    123         #self.model_output = self.model.model_output # this allows the TreeEnsemble to translate model outputs types by how it loads the model

~/.pyenv/versions/3.8.2/envs/py382/lib/python3.8/site-packages/shap/explainers/tree.py in __init__(self, model, data, data_missing, model_output)
    724             self.original_model = model
    725             self.model_type = "xgboost"
--> 726             xgb_loader = XGBTreeModelLoader(self.original_model)
    727             self.trees = xgb_loader.get_trees(data=data, data_missing=data_missing)
    728             self.base_offset = xgb_loader.base_score

~/.pyenv/versions/3.8.2/envs/py382/lib/python3.8/site-packages/shap/explainers/tree.py in __init__(self, xgb_model)
   1324         self.read_arr('i', 29) # reserved
   1325         self.name_obj_len = self.read('Q')
-> 1326         self.name_obj = self.read_str(self.name_obj_len)
   1327         self.name_gbm_len = self.read('Q')
   1328         self.name_gbm = self.read_str(self.name_gbm_len)

~/.pyenv/versions/3.8.2/envs/py382/lib/python3.8/site-packages/shap/explainers/tree.py in read_str(self, size)
   1454 
   1455     def read_str(self, size):
-> 1456         val = self.buf[self.pos:self.pos+size].decode('utf-8')
   1457         self.pos += size
   1458         return val

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 342: invalid start byte

homofortis on 19 Jun 2020

I've found a way around from another similar issue comment

model_barr = model.save_raw()[4:]
model.save_raw = lambda: model_barr

homofortis on 19 Jun 2020

👍1

I tried your workaround but a new error came up:
AttributeError: 'XGBClassifier' object has no attribute 'save_raw'

I have xgboost 1.1.1, what is your version?

GuidoBartoli on 20 Jul 2020

Sorry, my fault, I was using the scikit-learn interface, so I needed to first extract the "booster" object.

model = xgb.XGBClassifier()
booster = model.get_booster()
model2 = booster.save_raw()[4:]
booster.save_raw = lambda: model2
explainer = shap.TreeExplainer(booster)
values = explainer.shap_values(test_x, test_y)

Thanks! :+1:

GuidoBartoli on 20 Jul 2020

xgb = XGBClassifier(random_state=42)
mymodel = xgb.fit(X_train, y_train)

mybooster = mymodel.get_booster()    
model_bytearray = mybooster.save_raw()[4:]
def myfun(self=None):
    return model_bytearray
    mybooster.save_raw = myfun

explainer = shap.TreeExplainer(mybooster)
shap_values = explainer.shap_values(X_train)

shap.summary_plot(shap_values, X_train)