Xgboost: XGBClassifier.predict_proba() does not return probabilities even w/ binary:logistic

Created on 22 Dec 2016 · 5Comments · Source: dmlc/xgboost

While using XGBClassifier with early stopping, if we specify a value for best_ntree_limit in predict_proba() that's less than n_estimators, the predicted probabilities are not scaled (we get values < 0 and also > 1). When best_ntree_limit is the same as n_estimators, the values are alright.

Please note that I am indeed using "binary:logistic" as the objective function (which should give probabilities).

Here's my snippet:

xgb_classifier_mdl = XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.8,
gamma=0, learning_rate=0.025, max_delta_step=0, max_depth=8,
min_child_weight=1, missing=None, n_estimators=400, nthread=16,
objective='binary:logistic', reg_alpha=0, reg_lambda=1,
scale_pos_weight=4.8817476383265861, seed=1234, silent=True,
subsample=0.8)

xgb_classifier_y_prediction = xgb_classifier_mdl.predict_proba(
X_holdout,
xgb_classifier_mdl.best_ntree_limit
)

print (xgb_classifier_y_prediction)
print ('min, max:',min(xgb_classifier_y_prediction[:,0]), max(xgb_classifier_y_prediction[:,0]))
print ('min, max:',min(xgb_classifier_y_prediction[:,1]), max(xgb_classifier_y_prediction[:,1]))

Here are sample results I am seeing in my log:

[[ 1.65826225 -0.65826231]
[-0.14675128 1.14675128]
[ 2.30379772 -1.30379772]
...,
[ 1.36610699 -0.36610693]
[ 1.19251108 -0.19251104]
[ 0.01783651 0.98216349]]
min, max: -0.394902 2.55794
min, max: -1.55794 1.3949

As you can see the values are definitely NOT probabilities, they should be scaled to be from 0 to 1.

Source

vatsan

Most helpful comment

The 2nd parameter to predict_proba is output_margin. Since you are passing a non-zero xgb_classifier_mdl.best_ntree_limit to it, you obtain marginal log-odds predictions which are, of course, not probabilities.

khotilov on 1 Feb 2017

👍3

All 5 comments

khotilov on 1 Feb 2017

👍3

Aah, thanks @khotilov my bad, i didn't notice the second argument. Closing this issue and removing my pull request.

vatsan on 1 Feb 2017

I faced the same issue , all i did was take the first column from pred.
pred[:,1]

This might be a silly question , how do input the best tree limit if the second arguement is output margin

Mayanksoni20 on 7 Mar 2018

@Mayanksoni20
You can pass it in as a keyword argument:

xgb_classifier_y_prediction = xgb_classifier_mdl.predict_proba(
                                    Xtest,
                                    ntree_limit = xgb_classifier_mdl.best_ntree_limit
                                )

vatsan on 7 Mar 2018

What really are the two columns returned by predict_proba() ??