Shap: nsamples for TreeExplainer

Created on 26 Feb 2019 · 3Comments · Source: slundberg/shap

I'm using a TreeExplainer to calculate SHAP values for a large dataset (~1.7m rows, ~170 cols), model generated with LightGBM. Is there a preferred method to take nsamples for the background to speed up the process? Or is this not possible with the TreeExplainer?

Source

cbeauhilton

Most helpful comment

Sounds good! Just beware of putting 1.7m points into the plotting functions. They might not like that many points...

slundberg on 11 Mar 2019

😄1 👍1

All 3 comments

The nice thing about TreeExplainer is that it does not need to take any samples (unless you use the feature_dependence="independent" option). If you are trying to get a summary of your data I would just explain a random subsample of 10k and then plot whatever you want. If you really need to explain every sample very quickly you could try using approximate=True which will probably be a reasonable approximation if you have lots and lots of trees.

slundberg on 2 Mar 2019

👍1

Awesome, thanks, as always.

I was getting similar SHAP results when running debugging versions of the experiments with 10-20k samples as with the full 1.7m, I just wanted to make sure this was theoretically sound (enough).

So I will move forward with small samples for explanation, then probably just let the thing run overnight for the publication, more as an exercise in completeness than anything else.

cbeauhilton on 4 Mar 2019

Sounds good! Just beware of putting 1.7m points into the plotting functions. They might not like that many points...

slundberg on 11 Mar 2019

😄1 👍1

Was this page helpful?

0 / 5 - 0 ratings