Use the demo example of xgboost/demo/guide-python/basic_walkthrough.py
to show the issue.
First, let's train a simple regrsssion model.
import numpy as np
import xgboost as xgb
dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'reg:linear' }
watchlist = [(dtest,'eval'), (dtrain,'train')]
num_round = 2
bst = xgb.train(param, dtrain, num_round, watchlist)
Then we dump the model to a text file.
bst.dump_model('/tmp/dump.raw.txt')
This is the content of model dump text file.
booster[0]:
0:[f29<-1.00136e-05] yes=1,no=2,missing=1
1:[f56<-1.00136e-05] yes=3,no=4,missing=3
3:leaf=0.42844
4:leaf=-0.427938
2:[f109<-1.00136e-05] yes=5,no=6,missing=5
5:leaf=-0.485704
6:leaf=0.490741
booster[1]:
0:[f60<-1.00136e-05] yes=1,no=2,missing=1
1:[f67<-1.00136e-05] yes=3,no=4,missing=3
3:leaf=0.0137908
4:leaf=0.790517
2:leaf=-0.9226
Let's exmaine the first ten predictions.
preds = bst.predict(dtest, ntree_limit=1)
leafs = bst.predict(dtest, ntree_limit=1, pred_leaf=True)
print preds[0:10]
print leafs[0:10]
Below is the output. preds
give the plain regression values and leafs
give the leaf node number.
[ 0.07206208 0.9284395 0.07206208 0.07206208 0.01429605 0.9284395
0.9284395 0.07206208 0.9284395 0.9284395 ]
[4 3 4 4 5 3 3 4 3 3]
Here is the issue: for preds[0]
the predicted value is 0.07206208
, while leafs[0]
gives -0.427938
which is the 4th node value of booster[0]
. I would expect to get the same value from straight prediction or by leaf node number. Setting ntree_limit = 2
or ntree_limit = 0
to use all trees still gives insistent predictions.
I am a first time user of xgboost, something could be wrong with my understanding. But is it possible something wrong with dump_model
member function?
+1, I have the same concerns. Rules with 1.00136e-05 seem highly suspicious for me.
[UPDATE] check http://stats.stackexchange.com/questions/193617/machine-learning-on-dummy-variables
Here is the cause I found out. By default xgboost use base_score=0.5
, therefore output from predict
function call needs to subtract base_score
to get the plain prediction from xgboost.
In the example I have preds[0]-base_score=0.07206208-0.5=-0.4279379
which is exactly the 4th node value of booster[0]
.
I tried to set base_score=0
in data training, however it doesn't output plain prediction value. I still needs to subtract 0.5 from prediction.
Oh, great thanks for this explanation! You've just saved my night) Probably this unobvious effect deserves a separate issue...
But still, how do you interpret the comparisons f29<-1.00136e-05, f109<-1.00136e-05 for a binary feature?
[UPDATED]
You may wonder how to interpret the < 1.00001 on the first line. Basically, in a sparse Matrix, there is no 0, therefore, looking for one hot-encoded categorical observations validating the rule < 1.00001 is like just looking for 1 for this feature.
xgboost treats sparse values as "missing", so they go into the missing branch independent of the split value.
Perhaps that should be in FAQ...
@driftwoods :
I tried to set base_score=0 in data training, however it doesn't output plain prediction value. I still needs to subtract 0.5 from prediction.
Could you please provide an example of what you mean here? I've just tried 'base_score':0 , and the predictions from the 1st tree were exactly the leaf values, as it was supposed to be.
@khotilov
I was using a old version of xgboost (the one installed from pip which is compiled in Dec 2015). Setting 'base_score':0
has the same effect as the default 'base_score':0.5
. That is you have to subtract 0.5 from predict to get the same result from leaf value. This is a bug that has been fixed in the latest version, as your result shows.
Another bug I found with the pip version of xgboost is that it treats 0 as missing in non sparse matrix. This bug is also fixed in latest version. I strongly recommend the maintainer of xgboost in pip to update to the latest version.
Most helpful comment
Here is the cause I found out. By default xgboost use
base_score=0.5
, therefore output frompredict
function call needs to subtractbase_score
to get the plain prediction from xgboost.In the example I have
preds[0]-base_score=0.07206208-0.5=-0.4279379
which is exactly the 4th node value ofbooster[0]
.I tried to set
base_score=0
in data training, however it doesn't output plain prediction value. I still needs to subtract 0.5 from prediction.