Hi @slundberg
If I have to use SHAP for large data( say 1 million and above records). How can I use it as its taking a lot of time to compute for small data with 30000 records.
I came across shap.sample(data,100) which takes a sample of data. Doesn't this account for variance and bias in data?
Please help me with this.
It depends on which explainer you are using. For TreeExplainer you could switch to tree_path_dependent or even the approximate mode to get more speed. For any explainer you can parallelize by running on different sets of samples. If you just want a global view of 1 million records then I would use shap.sample(X, 10000) and explain those samples for all the plots you might need.
@slundberg
Hi, Thanks for the quick reply.
When we select 10000 records out of a million records wont that cause bias in data. what if i wanted a global interpretation of all million records without sampling and without taking more time.
^ I think @vaishkiva asked a really important question. Looking forward to your answer @slundberg
SHAP estimation is embarrassingly parallel, so moving TreeExplainer to GPU can be the answer you are looking for.
In fact it has been achieved recently by good folks at RAPIDS, who report some very good results (an order of magnitude speed-ups can be observed even for moderately sized 10k datasets, so unlike for the boosted trees algos themselves, in case of SHAP values, GPUs do not require Higgs-scale 10M datasets to outperform CPUs). See their paper at arXiv. And to see that these speed-ups (and a reasonable GPU memory footprint) hold for big data sets as well, see NVIDIA's recent blog post, where SHAP values estimation by a GPU-enabled version of shap_values() was speeded up 22 times on 11M rows of data.
Given that RAPIDS projects are incubated by NVIDIA, they take great care about portability, so their GPU implementation (see https://github.com/rapidsai/gputreeshap) has a form of an open sourced C++ library (a single CUDA header file).
The usage example they provide here demonstrating how to call their main GPUTreeShap function keeps the entire simplified modeling pipeline in the GPU. The question is how they used XGBoost - did they include XGBoost CUDA code or invoked their CUDA library from python using bridges like pycuda? Hopefully, we shall get their reply soon :)
It turns out the work on integrating the above CUDA version is already ongoing in slundberg/shap#1571!:)
This work is also integrated into xgboost 1.3 https://xgboost.readthedocs.io/en/latest/gpu/index.html#gpu-accelerated-shap-values.
1.3 is in release candidate phase at the time of writing and can be obtained via
python -m pip install xgboost==1.3.0rc1
@vaishkiva @ms8909 : great news guys, because thanks to @RAMitchell CUDA work there is now a GPU implementation at least for tree-based algos (shap.GPUTreeExplainer()). Have a try yourselves, you may be pleasantly surprised by the performance improvements: here's a reproducible performance comparison test for LightGBM models using our CUDA 10.1 py38 container: https://github.com/slundberg/shap/issues/1650#issuecomment-748482915.
When we moved the estimation of SHAP values on a 100k sample (times a few hundred columns) from the CPU (a single thread of Intel Xeon Platinum) to the GPU (a single Tesla V100), wall clock time shortened from an hour to... less than a minute!:) GPU has an enormous advantage here, because unlike LightGBM itself, the standard shap.TreeExplainer implementation was only single-threaded...
Most helpful comment
@slundberg
Hi, Thanks for the quick reply.
When we select 10000 records out of a million records wont that cause bias in data. what if i wanted a global interpretation of all million records without sampling and without taking more time.