There is one extra column that comes out (beyond the features) when you run the shap_values -- it is obviously the BaseValue. How is this computed? Intuitively it seems like it should be mean(y_hat), e.g. the mean response of all the predictions. This is what I would predict if I didn't know any of the features. But in doing this on a simple dataset -- it's close, but not exactly mean(y_hat). Wondering how that is computed? Also as a sanity check I checked whether mean(y) was close to it -- it is. That's good -- the trained random forest is an unbiased estimator.

Which model are you using? It should be the mean, but XGBoost (if you are using it) weighs its samples by the hessian. See the discussion at the end of #29
I'm just using the sklearn Random Forest Regressor right now.
Ah! The code just uses the weighted_n_node_samples attribute from the
sklearn Tree objects to compute expectations. Since when you trained each
tree for the random forest it took only a subset of samples for each tree,
that means the expectation will be slightly different than when you put all
the samples through the tree (which is what happens when you apply to the
trained model to the training dataset). So the small difference you are
seeing is because in a random forest the entire training dataset is not
used to train each tree. That make sense?
On Thu, May 17, 2018 at 9:34 AM Sourav Dey notifications@github.com wrote:
I'm just using the sklearn Random Forest Regressor right now.
—
You are receiving this because you commented.Reply to this email directly, view it on GitHub
https://github.com/slundberg/shap/issues/90#issuecomment-389929738, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADkTxaaJos_aeyHxI4bNRDcoWfsjRVOWks5tzaatgaJpZM4UCc8D
.
Most helpful comment
Ah! The code just uses the
weighted_n_node_samplesattribute from thesklearn Tree objects to compute expectations. Since when you trained each
tree for the random forest it took only a subset of samples for each tree,
that means the expectation will be slightly different than when you put all
the samples through the tree (which is what happens when you apply to the
trained model to the training dataset). So the small difference you are
seeing is because in a random forest the entire training dataset is not
used to train each tree. That make sense?
On Thu, May 17, 2018 at 9:34 AM Sourav Dey notifications@github.com wrote: