I'm using a TreeExplainer to calculate SHAP values for a large dataset (~1.7m rows, ~170 cols), model generated with LightGBM. Is there a preferred method to take nsamples for the background to speed up the process? Or is this not possible with the TreeExplainer?
The nice thing about TreeExplainer is that it does not need to take any samples (unless you use the feature_dependence="independent" option). If you are trying to get a summary of your data I would just explain a random subsample of 10k and then plot whatever you want. If you really need to explain every sample very quickly you could try using approximate=True which will probably be a reasonable approximation if you have lots and lots of trees.
Awesome, thanks, as always.
I was getting similar SHAP results when running debugging versions of the experiments with 10-20k samples as with the full 1.7m, I just wanted to make sure this was theoretically sound (enough).
So I will move forward with small samples for explanation, then probably just let the thing run overnight for the publication, more as an exercise in completeness than anything else.
Sounds good! Just beware of putting 1.7m points into the plotting functions. They might not like that many points...
Most helpful comment
Sounds good! Just beware of putting 1.7m points into the plotting functions. They might not like that many points...