Shap: KernelExplainer with textual data using pipeline

Created on 7 Nov 2018  路  15Comments  路  Source: slundberg/shap

Hi I was looking through and I found no examples using KernelExplainer to explain text data predictions so I decided to test it out using the dataset i found on https://www.superdatascience.com/machine-learning/.

I encountered a problem in the KernelExplainer part at the last bit, where I believe the problem is the way I input the data and model into the explainer. Can anyone advise me on what I should revise so as to make the explainer work? Thanks.

Dataset: https://drive.google.com/file/d/1-pzY7IQVyB_GmT5dT0yRx3hYzOFGrZSr/view?usp=sharing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import re
import nltk

#Load the data
os.chdir('C:\\Users\\Win\\Desktop\\MyLearning\\Explainability\\SHAP')
review = pd.read_csv('Restaurant_Reviews.tsv', sep='\t')

#Clean the data
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

def clean_text(df_text_column, data):   
    corpus = []
    for i in range(0, len(data)):
        text = re.sub('[^a-zA-Z]', ' ', df_text_column[i])
        text = text.lower()
        text = text.split()
        ps = PorterStemmer()
        text = [ps.stem(word) for word in text if not word in set(stopwords.words('english'))]
        text = ' '.join(text)
        corpus.append(text)
    return corpus

X = pd.DataFrame({'Review':clean_text(review['Review'],review)})['Review']
y = review['Liked']

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Creating the pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer() 
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
from sklearn.pipeline import make_pipeline
np.random.seed(0)
rf_pipe = make_pipeline(vect, rf)
rf_pipe.steps
rf_pipe.fit(X_train, y_train)

y_pred = rf_pipe.predict(X_test)
y_prob = rf_pipe.predict_proba(X_test)

#Performance Metrics
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred) #Accuracy
metrics.roc_auc_score(y_test, y_prob[:, 1]) #ROC-AUC score

# use Kernel SHAP to explain test set predictions
import shap
explainer = shap.KernelExplainer(rf_pipe.predict_proba, X_train, link="logit")
shap_values = explainer.shap_values(X_test, nsamples=100)

# plot the SHAP values
shap.force_plot(explainer.expected_value[0], shap_values[0][0,:], X_test.iloc[0,:], link="logit")

All 15 comments

Could you post the error as well? One comment is that you should use something like shap.kmeans(X_train, 1) instead of passing the whole training dataset as the background.

It says AttributeError: 'numpy.ndarray' object has no attribute 'lower' upon running the shap.KernelExplainer.

When I tried to use shap.kmeans(X_train, 1), it says IndexError: tuple index out of range. I'm guessing I have to do some tweaks to the pipeline, but I'm still not able to solve it.

Ah. The issue is you are are passing a text document where each feature is really each word, but KernelExplainer expects a matrix or DataFrame where each feature is a column. I would like to directly support text input at some point, but for now you would have to do it manually. I would recommend explaining just the RF in your pipeline (using TreeExplainer) which will give you attributions for each word in the matrix produced by TfidfVectorizer. You could then use the the get_feature_names method of TfidfVectorizer to see how those columns match to your original words.

Optimally we should be able to just take the whole pipeline object and do all this automatically. Along the same lines it would be great to have a "text_plot" routine to display the importance of each text sample. But I don't have any timeline for either of these (unless someone wants to send a PR for it).

Got it. Thanks for your inputs on this!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import re
import nltk
from pathlib import Path

#Load the data

review = pd.read_csv('Restaurant_Reviews.tsv', sep='\t')

review = pd.read_csv("/home/qpaas/Desktop/rr.csv", sep='\t')

pd.read_csv("/home/mypc/Documents/pcap/s2csv")

nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
def clean_text(df_text_column, data):
corpus = []
for i in range(0, len(data)):
text = re.sub('[^a-zA-Z]', ' ', df_text_column[i])
text = text.lower()
text = text.split()
ps = PorterStemmer()
text = [ps.stem(word) for word in text if not word in set(stopwords.words('english'))]
text = ' '.join(text)
corpus.append(text)
return corpus

X = pd.DataFrame({'Review':clean_text(review['Review'],review)})['Review']
y = review['Liked']# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

Creating the pipeline

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
from sklearn.pipeline import make_pipeline
np.random.seed(0)
rf_pipe = make_pipeline(vect, rf)
rf_pipe.steps
rf_pipe.fit(X_train, y_train)

y_pred = rf_pipe.predict(X_test)
y_prob = rf_pipe.predict_proba(X_test)

from sklearn import metrics
metrics.accuracy_score(y_test, y_pred) #Accuracy
metrics.roc_auc_score(y_test, y_prob[:, 1]) #ROC-AUC score

use Kernel SHAP to explain test set predictions

import shap
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

Hi, have tried replicating the above, however I am finding trouble in executing , the error is as following, could you please help me solve this


ValueError Traceback (most recent call last)
in
2 import shap
3 explainer = shap.TreeExplainer(rf)
----> 4 shap_values = explainer.shap_values(X_test)

~/anaconda3/lib/python3.7/site-packages/shap/explainers/tree.py in shap_values(self, X, y, tree_limit, approximate)
213 X = X.reshape(1, X.shape[0])
214 if X.dtype != self.model.dtype:
--> 215 X = X.astype(self.model.dtype)
216 X_missing = np.isnan(X, dtype=np.bool)
217 assert str(type(X)).endswith("'numpy.ndarray'>"), "Unknown instance type: " + str(type(X))

ValueError: could not convert string to float: 'present food aw'

also, I have tried in providing with the vector format.
shap_values = explainer.shap_values(test_vectors)

it is giving me the following response

~/anaconda3/lib/python3.7/site-packages/shap/explainers/linear.py in __init__(self, model, data, nsamples, feature_dependence)
70 raise Exception("A background data distribution must be provided!")
71
---> 72 self.expected_value = np.dot(self.coef, self.mean) + self.intercept
73
74 self.M = len(self.mean)

AttributeError: 'LinearExplainer' object has no attribute 'mean'

In the TreeExplainer error it is because you are passing the text and not the vectorized input the RandomForest is expecting. In the second error I don't see enough details to know what might be going wrong. Perhaps a self contained example would be good. Also try master since LinearExplainer recently got some updates.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import re
import nltk
from pathlib import Path

review = pd.read_csv("/home/qpaas/Desktop/rr.csv", sep='\t')

nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
def clean_text(df_text_column, data):
corpus = []
for i in range(0, len(data)):
text = re.sub('[^a-zA-Z]', ' ', df_text_column[i])
text = text.lower()
text = text.split()
ps = PorterStemmer()
text = [ps.stem(word) for word in text if not word in set(stopwords.words('english'))]
text = ' '.join(text)
corpus.append(text)
return corpus
X = pd.DataFrame({'Review':clean_text(review['Review'],review)})['Review']
y = review['Liked']

As said I have tried to vectorise before loading it into the model

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
X_train_vect = vect.fit_transform(X_train)
X_test_vect = vect.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train_vect, y_train)
from sklearn.pipeline import make_pipeline

y_pred = rf.predict(X_test_vect)
y_prob = rf.predict_proba(X_test_vect)

from sklearn import metrics
metrics.accuracy_score(y_test, y_pred) #Accuracy
metrics.roc_auc_score(y_test, y_prob[:, 1]) #ROC-AUC score

use Kernel SHAP to explain test set predictions

import shap
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test_vect)

so this is throwing an error which is following, can you help me in discerning what actually went wrong.
TypeError Traceback (most recent call last)
in
2 import shap
3 explainer = shap.TreeExplainer(rf)
----> 4 shap_values = explainer.shap_values(X_test_vect)

~/anaconda3/lib/python3.7/site-packages/shap/explainers/tree.py in shap_values(self, X, y, tree_limit, approximate)
214 if X.dtype != self.model.dtype:
215 X = X.astype(self.model.dtype)
--> 216 X_missing = np.isnan(X, dtype=np.bool)
217 assert str(type(X)).endswith("'numpy.ndarray'>"), "Unknown instance type: " + str(type(X))
218 assert len(X.shape) == 2, "Passed input data matrix X must have 1 or 2 dimensions!"

TypeError: No loop matching the specified signature and casting
was found for ufunc isnan

Thanks
Hemanth
TypeError Traceback (most recent call last)
in
2 import shap
3 explainer = shap.TreeExplainer(rf)
----> 4 shap_values = explainer.shap_values(X_test_vect)

~/anaconda3/lib/python3.7/site-packages/shap/explainers/tree.py in shap_values(self, X, y, tree_limit, approximate)
214 if X.dtype != self.model.dtype:
215 X = X.astype(self.model.dtype)
--> 216 X_missing = np.isnan(X, dtype=np.bool)
217 assert str(type(X)).endswith("'numpy.ndarray'>"), "Unknown instance type: " + str(type(X))
218 assert len(X.shape) == 2, "Passed input data matrix X must have 1 or 2 dimensions!"

TypeError: No loop matching the specified signature and casting
was found for ufunc isnan

Ahah!
I haven't loaded shap with appropriate things. I have executed the code well "as of now"
It was great to see the level of enthusiasm from the quick response from your end.
thanks Slundberg.

Sorry for necrobumping but @HemanthKumar-Thope I'm having the same issue as you, how did you fix it exactly?

Traceback (most recent call last):
  File "main.py", line 56, in <module>
    print(shap_tree_explanation(trained_classifier, test_data))
  File "/home/architect/git_repositories/dissertation/text-classification/sensitivity_classifier/explainers.py", line 45, in shap_tree_explanation
    shap_values = explainer.shap_values(transformed_test_data)
  File "/home/architect/.local/share/virtualenvs/text-classification-YBkDjDw-/lib/python3.7/site-packages/shap/explainers/tree.py", line 227, in shap_values
    X_missing = np.isnan(X, dtype=np.bool)
TypeError: No loop matching the specified signature and casting was found for ufunc isnan

My explainer function is as follows:

def shap_tree_explanation(pipeline, data):
    # extract trained classifier from Pipeline and create tree explainer from it
    explainer = shap.TreeExplainer(pipeline["clf"])

    # transform data with pipeline's vectorizer
    transformed_data = pipeline["vect"].transform(data)

    # calculate SHAP values for tree model
    shap_values = explainer.shap_values(transformed_data)

    return shap_values

My pipeline is defined as such:

    pipeline = Pipeline(
        steps=(
            ("vect", get_vectorizer()),
            # HashingVectorizer(
            #     norm='l2',
            #     stop_words="english",
            #     strip_accents="unicode",
            #     lowercase=True,
            # ),
            ("clf", classifier),
        ),
        verbose=True,
    )

and the data argument is a list of a single string containing a document (string) for which to explain the prediction

@slundberg I am having same error as @HemanthKumar-Thope had. But, I am unable to find any solution for this. Neither it is provided in this thread. Please help if you have any idea about it.

@PrasadKshirsagar Did you solve it? I have the same error. I just call
shap.TreeExplainer(model).shap_values(X_train) and get
TypeError: No loop matching the specified signature and casting was found for ufunc isnan.

I ran into the error too and my input X was sparse. Since you are dealing with text I guess you too?

A simple solution to check: X.toarray()

Was this page helpful?
0 / 5 - 0 ratings

Related issues

resdntalien picture resdntalien  路  3Comments

yolle103 picture yolle103  路  3Comments

ArpitSisodia picture ArpitSisodia  路  3Comments

TdoubleG picture TdoubleG  路  4Comments

Nithanaroy picture Nithanaroy  路  4Comments