Shap: How is the "BaseValue" for TreeShap computed?

Created on 17 May 2018  Â·  3Comments  Â·  Source: slundberg/shap

There is one extra column that comes out (beyond the features) when you run the shap_values -- it is obviously the BaseValue. How is this computed? Intuitively it seems like it should be mean(y_hat), e.g. the mean response of all the predictions. This is what I would predict if I didn't know any of the features. But in doing this on a simple dataset -- it's close, but not exactly mean(y_hat). Wondering how that is computed? Also as a sanity check I checked whether mean(y) was close to it -- it is. That's good -- the trained random forest is an unbiased estimator.

screen shot 2018-05-16 at 10 55 25 pm

Most helpful comment

Ah! The code just uses the weighted_n_node_samples attribute from the
sklearn Tree objects to compute expectations. Since when you trained each
tree for the random forest it took only a subset of samples for each tree,
that means the expectation will be slightly different than when you put all
the samples through the tree (which is what happens when you apply to the
trained model to the training dataset). So the small difference you are
seeing is because in a random forest the entire training dataset is not
used to train each tree. That make sense?

On Thu, May 17, 2018 at 9:34 AM Sourav Dey notifications@github.com wrote:

I'm just using the sklearn Random Forest Regressor right now.

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
https://github.com/slundberg/shap/issues/90#issuecomment-389929738, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADkTxaaJos_aeyHxI4bNRDcoWfsjRVOWks5tzaatgaJpZM4UCc8D
.

All 3 comments

Which model are you using? It should be the mean, but XGBoost (if you are using it) weighs its samples by the hessian. See the discussion at the end of #29

I'm just using the sklearn Random Forest Regressor right now.

Ah! The code just uses the weighted_n_node_samples attribute from the
sklearn Tree objects to compute expectations. Since when you trained each
tree for the random forest it took only a subset of samples for each tree,
that means the expectation will be slightly different than when you put all
the samples through the tree (which is what happens when you apply to the
trained model to the training dataset). So the small difference you are
seeing is because in a random forest the entire training dataset is not
used to train each tree. That make sense?

On Thu, May 17, 2018 at 9:34 AM Sourav Dey notifications@github.com wrote:

I'm just using the sklearn Random Forest Regressor right now.

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
https://github.com/slundberg/shap/issues/90#issuecomment-389929738, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADkTxaaJos_aeyHxI4bNRDcoWfsjRVOWks5tzaatgaJpZM4UCc8D
.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

brookewenig picture brookewenig  Â·  3Comments

DiliSR picture DiliSR  Â·  4Comments

TdoubleG picture TdoubleG  Â·  4Comments

shoaibkhanz picture shoaibkhanz  Â·  4Comments

1vecera picture 1vecera  Â·  3Comments