First of all,congrats for the amazing shap package @slundberg.
I understand that the following code produces the shap values for every feature in every observation of my model:
explainer = shap.TreeExplainer(my_model)
shap_values = explainer.shap_values(X_train)
Then, it is possible to plot for a single observation the shaps values for every feature:
shap.force_plot(explainer.expected_value, shap_values[0,:], X_train.iloc[0,:], link = 'logit')
However, my question is: how can I evaluate the shaps values of a possible prediction on the test set?
Ideally, I would like to also observe what is the 'importance' of each feature in every single prediction - in other words: which feature has the highest shap value in every single prediction?
I believe that if you use your test set you get the shap value for each prediction in your dataset
@ddearauj , you mean use the test set when discovering the shap values? Something like this?
explainer = shap.TreeExplainer(my_model)
shap_values_test = explainer.shap_values(X_test)
shap.force_plot(explainer.expected_value, shap_values_test[0,:], X_test.iloc[0,:], link = 'logit')
That's right. my_model contains data about the training set that it was built on, so TreeExplainer uses that for computing conditional expectations. You pass to explainer.shap_values whatever samples you want to explain, which is often a test set.
Thank you! Could you also clarify the correct explanation of what is the base_value? I have read here that it should correspond to the mean of the predictions of your classifier, however, I am using a LightGBM model and unfortunately that is not the case.
For TreeExplainer it is the mean of the output of the trees, which for logistic regression in LightGBM is the mean of the log-odds predicted by the model. This is different than the mean of the probabilities.
Thank you @slundberg ! Completely understood!
Most helpful comment
Thank you @slundberg ! Completely understood!