Shap: How to speed up SHAP computation

Created on 8 May 2018 · 21Comments · Source: slundberg/shap

Hi,

The package itself is really interesting and intuitive to use.
I notice however it takes quite long time to run on neural network with practical feature & sample size using KernelExplainer.
Question, is there any document to explain how to properly choose

sample size fed into shap.KernelExplainer, and what is the guiding principal to choose these samples;
number of samples fed into function explainer.shap_values, I would assume it has something to do with number of features(columns)

For example, I have over 1 million record with 400 raw features (continuous + unencoded categorical).
Any suggestion would be appreciated.

screen shot 2018-05-08 at 2 50 41 pm

Above screen shot is the example using 50 samples in KernelExplainer as typical feature values and 2000 case with 500 repeats in shap_values perturbation.

Source

bingojojstu

👍1

Most helpful comment

FYI...the progress bar is only for KernelExplainer and SamplingExplainer. And for anyone else landing on this thread DeepExplainer and GradientExplainer are now good alternatives in SHAP if you are using a Neural network.

slundberg on 20 Aug 2018

👍6

All 21 comments

Explaining models with large input spaces is challenging (i.e. your 400 features). Two suggestions I have are 1) either try using a smaller set of reference values (instead of 50). If for some reason there is a value that means "missing", such as 0, that will speed things up a lot (like bag of words). 2) Think if there are ways to group the model features together into units that could be treated together (like 1 hot groups, super pixels, etc.).

If that is not enough you could consider using a deep learning specific method such as DeepLIFT or integrated gradients. The output of DeepLIFT can be viewed as an approximation of SHAP values. Note both methods only handle a single reference value (rather than integrating over 50 like you are doing now).

Finally, not knowing the type of data it is hard to say, but you might also consider a GBM model like XGBoost since if your features are not correlated (pixels, sequences, etc.) GBMs are often state of the art, and Tree SHAP will explain instances very fast for GBMs.

slundberg on 9 May 2018

@bingojojstu , I was wondering how to show the progress bar while running? is this something implemented in SHAP? Thanks.

ybeybe on 20 Aug 2018

@ybeybe yes it is implemented in SHAP.

bingojojstu on 20 Aug 2018

slundberg on 20 Aug 2018

👍6

I'm trying to implement shap_values with XGBoost and it is still taking forever. I have 35 features and limited the number of samples to 500. I set the tree_limit to 10, but I don't really understand what that input means - doesn't XGBoost provide a single tree to use? Why would there be multiple trees used? Any advice on how to decrease the run time would be super helpful.

Thanks

athorneak13 on 9 Dec 2018

Could you post the code you are using to explain the model? TreeExplainer
is usually very quick. XGBoost trains many trees (often thousands) so
tree_limit is useful if you want to explain the model with a different
early stopping criteria than was used while training the model.

On Sat, Dec 8, 2018 at 6:52 PM athorneak13 notifications@github.com wrote:

I'm trying to implement shap_values with XGBoost and it is still taking
forever. I have 35 features and limited the number of samples to 500. I set
the tree_limit to 10, but I don't really understand what that input means -
doesn't XGBoost provide a single tree to use? Why would there be multiple
trees used? Any advice on how to decrease the run time would be super
helpful.

Thanks

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/slundberg/shap/issues/77#issuecomment-445506424, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADkTxYCu3kXHZZIDPd3wPbjaXs7I8BAsks5u3HrtgaJpZM4T3ZU6
.

slundberg on 9 Dec 2018

I am somehow experiencing the same issue with a relatively large dataset (3M rows x 30 features), the shap_values method is taking ages. Is the library expected to be able to handle such sizes? The underlying model is LightGBM, and from what I read around here, gradient boosting models implementation should be fast.

jarandaf on 10 Jan 2019

@jarandaf Fast is relative :) 3M is a lot of rows. If you post the runtime for 10k rows I can tell you if that sounds reasonable. Also if you have lots of trees in your model then the approximate=True option in TreeExplainer's shap_values method can give you fast approximate answers.

slundberg on 19 Jan 2019

Any update on @athorneak13's query regarding SHAP TreeExplainer performance? I'm facing the same issue right now.

Code snippet -

import shap
shap.initjs()
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)

The data set is quite small (~150 data points), and the model training process was quick too. Is there a way to use parallel processing here? Its taking the same time on a 2 vCPUs 8 GB memory machine and an 8 vCPUs, 30 GB memory machine. I noticed that it's using only 1 core and around 2.5 GB memory in both cases.

astrixg on 19 Feb 2019

@astrixg Could you provide more details about the model? It seems like it should run quickly for only 150 samples.

slundberg on 19 Feb 2019

model = XGBRegressor('colsample_bytree': 0.5, 'colsample_bylevel': 0.9, 'learning_rate': 0.5, 'min_child_weight': 1, 'subsample': 0.5, 'reg_lambda': 0.7, 'max_depth': 7, 'gamma': 0, 'n_estimators': 500)

no. of features = 97

astrixg on 20 Feb 2019

@slundberg Previously the shap_values generation was taking around 8-10 mins, which has now dropped to around 4 mins. While I haven't made any changes to the data or the model, I did upgrade the SHAP version, but not sure if that is what caused the change.

astrixg on 22 Feb 2019

You can try just calling shap inside XGBoost directly with its pred_contribs=True option. But with the model you gave it should take less than a second to explain 150 samples, so even 4 min seems like something else is going on.

slundberg on 2 Mar 2019

👍1

I'm also finding shap_values to be prohibitively and unusably expensive on real-world datasets. It takes a day of runtime on 20 cores at 100% each to finish on a small dataset with only 3.9M rows (~150 features each). Considering many real-world datasets have 100M+ rows, this is actually far more expensive than training the models, and it could likely take literally months to run shap_values on all of our datasets.

To clarify, this is all for a tree-based GBT model.

zpconn on 6 Mar 2019

@zpconn how many trees are you using? That does seems a bit slower than I would expect. You might also try a batch of 100k and compare the per-sample runtime. I have not tested the performance on datasets with millions of samples and it could cause some memory-based slowdowns.

My other question is what do you plan to do with 100 million explanations? If you are just using summary statistics and plots then explaining just 10k will give plenty of detail. You can also try the approximate method if you have a lot of trees.

slundberg on 7 Mar 2019

👍1

Are there any parameters to control/force parallelization? "shap_values" seems to only load about 25% (=12 cores) of my CPU.
I'm running a custom model with KernelExplainer (at about 1.5 it/s) and it basically takes forever (3 days), even though the predict takes only a second on its own. Is there any architectural way to make it work better with SHAP (apart from obviously making it simply predict faster)?

dp-dpo on 24 May 2019

reduce background sample size

On Thu, May 23, 2019 at 4:48 PM dp-dpo notifications@github.com wrote:

Are there any parameters to control/force parallelization? "shap_values"
seems to only load about 25% (=12 cores) of my CPU.
I'm running a custom model with KernelExplainer (at about 1.5 it/s) and it
basically takes forever (3 days), even thought the predict takes only a
second on its own. Is there any architectural way to make it work better
with SHAP (apart from obviously making it simply predict faster)?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/slundberg/shap/issues/77?email_source=notifications&email_token=AE32RMHZFP3OMM2CRKA6DGTPW4UOBA5CNFSM4E65SU5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWDZBRY#issuecomment-495423687,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE32RMC47JBVWAQBEWUEPCLPW4UOBANCNFSM4E65SU5A
.

bingojojstu on 11 Jun 2019

👍1

Hi @slundberg, great package, many thanks.

the case

I'm using a dataset with 12K Observations and ~ 160K features. The neural net has ~ 200K parameters. The walltime to perform is described in the table below and I observed that it seems to grow exponentially with the number of samples (and not linear as described in the docstring of the shap.Deepexplainer() method). Using all 12K samples might take ~31days. Whilst monitoring the machine load I observed, that only 2-3 CPU cores (of 16) are used and GPUs weren't touched. Therefore I tried to run the job in a tf.session() with different number of cores, but that had no effect (see table below).

the question

The only way I can currently imagine to get a robust estimate of the importancies is to perform shap.Deepexplainer() over multiple bootstraps for a good tradeoff between speed and accuracy. Do you have another / better idea? Thank you.

walltime as a function of samplesize

size [%] | #samples | runtime [min] | runtime [days]
------|---------------------|----------------------|----------------------
1 | 122 | 1.6 (=95s)
5 | 610 | 66
100 | 12,218 | 44,262 (estimated)* | ~31d (estimated)*

estimation:
log(66/1.6)) / log(5) = 2.17
66 min * (100/5)^2.17 = ~31d

walltime as a function of #CPUS (using 1% = 122 samples)

PARAMS['CPUS'] | CPUS used htop | runtime shap.Deepexplainer()
-- | -- | --
8 | 2 - 3 | 96 sec
1 | 2 - 3 | 95 sec

related code (to try to force using only and varying no. of CPUs)

if PARAMS['CPUS']>0:
    import os
    os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"  
    os.environ["CUDA_VISIBLE_DEVICES"]=""

    import tensorflow as tf
    import keras
    config = tf.ConfigProto(device_count={"CPU": PARAMS['CPUS']})
    keras.backend.tensorflow_backend.set_session(tf.Session(config=config))
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    print(sess)

thomasmooon on 1 Jul 2019

You can try just calling shap inside XGBoost directly with its pred_contribs=True option. But with the model you gave it should take less than a second to explain 150 samples, so even 4 min seems like something else is going on.

But, it looks like the pred_contribs only works for xgb.Booster. I trained an xgbClassifier model, on 1.1m+ data with 310 features. It takes 200min+ to get the shap values... and the pred_contribs not working for xgbClassifier ...

s-wxy on 15 Jul 2020

Are the following code snippets equivalent, and if so, couldn't we easily parallelize each iteration of the for loop in the first?

shap_values = []
for idx in range(num_rows):
    shap_val = explainer.shap_values(data.iloc[idx,:])
    shap_values.append(shap_val)

shap_values = explainer.shap_values(data.iloc[:num_rows,:])

christabella on 19 Jul 2020

Explaining models with large input spaces is challenging

But such data shape is bound to be the most popular use case of SHAP - it is so good at feature selection:) SHAP is the only method I have found to be superior (yielding better metrics on new data for the same features number) to the previous state-of-the-art - the "split" variable importance from LightGBM, for selecting features for any boosted tree family model. I intend to run GPUTreeExplainer on 2k features and millions of rows routinely as soon as the GPU version starts working.

I have seen some interesting results on real-life data for both regression and classification models (again, still using only samples for performance reasons), and the differences if favor of SHAP aren't small at all, in fact they are surprisingly large and tend to improve with sample size and increased realism, i.e. degree of model optimization. Xia Xiaomao et al. 2019 do not do it justice, by using unrealistic experimental setups, including toy datasets, inadequate in both dimensions, a suboptimal algorithm (XGB, RF instead of LGBM), and a poor evaluation method (single point at full feature set, where the diffs are smallest, instead of the whole curve across different feature set sizes). If you think @slundberg that the use of SHAP as a feature selector is novel enough to merit your academic interest, then I can email you those early results and some pointers on how to reproduce them on your datasets.