I was experimenting with one-step incremental XGBoost ensemble construction. When the Booster is created by xgboost.train function (python library), everything seems to work fine.
However when I create the booster and the update it using Booster.update like this:
```python
booster_ = xgboost.Booster({'objective': 'reg:linear'})
booster_.update(dtrain, 1)
the python process fails with a segmentation fault.
## Environment info
Operating System:
* **python 3.6** Mac OS X 10.10.5 (Darwin 14.5.0), Ubuntu 14.04.5 LTS (GNU/Linux 3.19.0-25-generic x86_64);
* **python 2.7** Mac OS X 10.10.6 (Darwin 15.6.0);
Compiler:
* **python 3.6** used `pip install xgboost`;
* **python 2.7** gcc (6.3.0 --without-multilib);
`xgboost` version used:
* **python 3.6** version 0.6 from pip;
* **python 2.7.13** git HEAD 4a63f4ab43480adaaf13bde2485d5bfedd952520;
## Steps to reproduce
```python
import xgboost
dtrain = xgboost.DMatrix(data=[[-1.0], [0.0], [1.0]], label=[0.0, -1.0, 1.0])
booster_ = xgboost.Booster({'objective': 'reg:linear', 'max_depth': 1})
booster_.update(dtrain, 1)
booster_.update(dtrain, 1)
The last line causes segmentation fault. I attach the crash report for python 2.7.13
I would like to clarify what my expectation is from doing incremental updates of empty Booster object returned by
booster_ = xgboost.Booster({'objective': 'reg:linear', 'max_depth': 1})
According to the general gradient boosting algorithm , I expected that after updating on a sample (X, y) in DMatrix dtrain
with
booster_.update(dtrain, 1)
the empty booster would become either f_0(x)
-- the constant prediction, or f_1(x)
-- prediction after one step of gradient boosting, as shown below (from Hastie, Tibshirani, Friedman; 2013 10th ed page 361).
I managed to trace the problem down to an empty RegTree::FVec.data
vector in hte provided reproducing example. The (manual) trace of the second call to booster_.update
is as follows:
this->UpdateOneIter(...);
this->PredictRaw(train, &preds_);
gbm_->Predict(data, out_preds, ntree_limit=0);
PredLoopInternal<GBTree>(p_fmat, out_preds, 0, ntree_limit, true);
PredValue(inst, gid, info.GetRoot(ridx), &feats, tree_begin, tree_end);
int tid = trees[i]->GetLeafIndex(*p_feats, root_index);
fails here:
include/xgboost/tree_model.h:#L528
..., feat.fvalue(split_index), feat.is_missing(split_index), ...
it seems that feat.data.size() == 0
. I don't know why after the first call to .update()
this vector is still empty, but non-empty after the alternative call to .train()
.
I found out what the root of the problem was. I turns out that creating an empty booster by calling
booster_ = xgboost.Booster({'objective': 'reg:linear'})
only partially initializes the GBTree booster. In particular, the crucial parameter num_feature
is set by default to 0 and does not get updated to a proper number of features by a subsequent .update()
call.
However, passing an explicit value for num_feature
resolves the segmentation fault:
booster_ = xgboost.Booster({'objective': 'reg:linear', 'num_feature': dtrain.num_col()})
I think that xgboost.Booster()
should issue a warning if either cache=()
is empty, or
'num_feature'
is not explicitly set in params
argument.
Most helpful comment
I found out what the root of the problem was. I turns out that creating an empty booster by calling
only partially initializes the GBTree booster. In particular, the crucial parameter
num_feature
is set by default to 0 and does not get updated to a proper number of features by a subsequent
.update()
call.However, passing an explicit value for
num_feature
resolves the segmentation fault:I think that
xgboost.Booster()
should issue a warning if eithercache=()
is empty, or'num_feature'
is not explicitly set inparams
argument.