I need to update a model with new data without retraining the model from scratch. That is, incremental training for the cases when not all the data available right away.
This problem is similar to the "can't fit data in memory" problem, which was raised before in #56, #163, #244. But it's been 2-3 years ago and I see some changes in available parameters process_type and updater. The FAQ suggests using external memory via cacheprefix. But this assumes I have all the data ready.
The solution in #1686 uses several iterations over the entire data.
Another related issue is #2970, in particular https://github.com/dmlc/xgboost/issues/2970#issuecomment-354684604. I tried 'process_type': 'update' but it throws the initial error mentioned in that issue. Without it, the model gives inconsistent results.
I tried various combinations of parameters for train in Python. And train keeps making the model from scratch or something else. Here're the examples.
In a nutshell, this is what works (sometimes) and needs a feedback from more experienced members of the community:
print('Full')
bst_full = xgb.train(dtrain=train, params=params)
print(mean_squared_error(y_true=y_test, y_pred=bst_full.predict(test)))
print('Subset 1')
bst_1 = xgb.train(dtrain=train_1, params=params)
print(mean_squared_error(y_true=y_test, y_pred=bst_1.predict(test)))
print('Subset 2')
bst_2 = xgb.train(dtrain=train_2, params=params)
print(mean_squared_error(y_true=y_test, y_pred=bst_2.predict(test)))
print('Subset 1 updated with subset 2')
bst_1u2 = xgb.train(dtrain=train_1, params=params)
bst_1u2 = xgb.train(dtrain=train_2, params=params, xgb_model=bst_1u2)
print(mean_squared_error(y_true=y_test, y_pred=bst_1u2.predict(test)))
Here I'm looking to minimize the difference between the first and the fourth models. But it's keep jumping up and down. Even with equalling total boosting rounds in both methods.
Is there a canonical way to update models with newly arriving data alone?
xgboost: 0.7.post3Contributors saying new-data training was impossible at the time of writing:
In #2495, I said incremental training was "impossible". A little clarification is in order.
xgb_model) thus does not do what many would think it would do. One gets undefined behavior when xgb.train is asked to train further on a dataset different from one used to train the model given in xgb_model. The behavior is "undefined" in the sense that the underlying algorithm makes no guarantee that the loss over (old data) + (new data) would be in any way reduced. Observe that the trees in the existing ensemble had no knowledge of the new incoming data. [EDIT: see @khotilov 's comment below to learn about situation where training continuation with different data would make sense.]'process_type': 'update'. I think it is an experimental feature, so proceed with your own risk. To use the feature, make sure to install the latest XGBoost (0.7.post3). The feature is currently quite limited, in that you are not allowed to modify the tree structure; only leaf values will be updated.Hope it helps!
@antontarasenko Actually, I'm curious about the whole quest behind "incremental learning": what it means and why it is sought after. Can we schedule a brief Google hangout session to discuss?
A vain guess: using an online algorithm for tree construction may do what you want. See this paper for instance.
Two limitations:
This paper is interesting too: it presents a way to find good splits without having all the data.
@Yunni The first item in @hcho3 's reply reminds me something about the newly added feature of checkpoint in Spark
We should have something blocking the user to use different training dataset for this feature to guarantee correctness
Right. I think we should also check boosterType as well. We can put a metadata file which contains boosterType and checksum of the dataset. Sounds good?
how you get checksum of dataset? content hash?
Yes. We can simply use LRC checksum
Isn鈥檛 it time consuming to calculate hash? Maybe we can simply adding reminders in the comments
@codingcat Indeed, at minimum we need to warn the user not to change the dataset for training continuation.
That said, I just found a small warning in the CLI example, which says
Continue from Existing Model
If you want to continue boosting from existing model, say0002.model, use
../../xgboost mushroom.conf model_in=0002.model num_round=2 model_out=continue.model
xgboost will load from
0002.modelcontinue boosting for 2 rounds, and save output to continue.model. However, beware that the training and evaluation data specified inmushroom.confshould not change when you use this function. [Emphasis mine]
Clearly we need to do a better job to make this warning more prominent.
@hcho3 while it's true that in some specific application contexts it makes sense to restrict training continuation to some data, I wouldn't make it a blank statement and wouldn't implement any hard restrictions on that. Yes, when you have a large dataset and your goal is to achieve optimal performance on the whole dataset, you wouldn't get it (for the reasons you have described) when incrementally learning with either separate parts of the datasets or with cumulatively increasing data.
However, there are applications when training continuation in new data makes good practical sense. E.g., in a situation when you get some new data that is related but has some sort of "concept drift", there are sometimes good chances that by taking an old model learned in old data as "prior knowledge", and adapting it to the new data by training continuation in that new data, you would get a better performing model for future use in data that would be like the new data than when training from scratch either with only this new data or with a combined sample of old + new data. Sometimes you don't even have access to old data anymore, or cannot combine it with your new data (e.g., for legal reasons).
@khotilov I stand corrected. Calling training continuation "undefined behavior" was sweeping generalization, if what you have described is true.
I have a question for you: how does training continuation with boosting fare when it comes to handling concept drift? I read papers where the authors use random forests to handle concept drift, with a sliding window to deprecate old trees. (For an example, see this paper.)
Firstly, there is a paper about using a random forest to initialise you Gbm model to get better final results than just rf or Gbm and in few rounds. I cannot find it however :( This seems like a similar consept except you are using a different Gbm to initialise. I guess the other main difference is that it is on another set of data...
Secondly, sometimes it is more important to train a model quickly. I have been working on some time series problems where I have been doing transfer learning with LSTMs. I train the base model on generic historical data and then use transfer learning for the fine tuning on specific live data. It would take too long to train a full new model on live data when ideally I would. I think the same could be true of using xgboost. I. E. 95% of model optimal prediction is better than no prediction.
@hcho3 While my mention of the "concept drift" was in a broad sense, the boosting continuation would likely do better with "concept shifts" (when new data has some differences but is expected to remain stable). Picking slow continuous data "drifts" would be harder. But for strong trending drifts, even that random forest method might not work well, and some forecasting elements would have to be utilized. A lot would depend on a situation.
Also, a weak spot for boosted trees learners is that they are greedy feature-by-feature partitioners, so they might not pick well on such kinds of changes where the univariate effects were not so significant, e.g., when only interactions change. It might be rather useful if we could add some sort of limited look-ahead functionality to xgboost. E.g., in https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/kdd13.pdf they have a bivariate scan step that I think might work well as a spin of the histogram algorithm.
As for why the predictive performance in future data that is similar to "new data" is sometimes worse for a model trained over a combined "old data" + "new data" dataset, when comparing to a training continuation in "new data", this is because the former model would be optimized over the whole combined dataset, and that might happen at the expense of "new data" when that "new data" is somewhat different and relatively small.
I thought incremental training with minibatches of data (just like SGD) is kind of equivalent to subsampling the rows at each iteration. Is the subsampling in XGBoost only performed once for the whole training lifecycle or once every iteration?
I also need to use incremental learning. I've read all links that have been mentioned above. However, I'm confused.
Finally, is there any version of XGBoost to retrain a trained xgb model based on new received data point or batch of data?
I've found below links that addressed this issue before the date of this post. Don't they work? Can't we do incremental learning with them? What's the problem with them?
https://github.com/dmlc/xgboost/issues/1686
https://github.com/dmlc/xgboost/issues/484
https://stackoverflow.com/questions/38079853/how-can-i-implement-incremental-training-for-xgboost?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
https://github.com/dmlc/xgboost/issues/2495
@benyaminelc90
This particular answer got close to proper incremental learning:
Still, as I learned from hcho3, GBM has limited capacity for updates without seeing all the data from the start. For example, it can easily update leaves but has difficulties with altering splits.
Most helpful comment
In #2495, I said incremental training was "impossible". A little clarification is in order.
xgb_model) thus does not do what many would think it would do. One gets undefined behavior whenxgb.trainis asked to train further on a dataset different from one used to train the model given inxgb_model. The behavior is "undefined" in the sense that the underlying algorithm makes no guarantee that the loss over (old data) + (new data) would be in any way reduced. Observe that the trees in the existing ensemble had no knowledge of the new incoming data. [EDIT: see @khotilov 's comment below to learn about situation where training continuation with different data would make sense.]'process_type': 'update'. I think it is an experimental feature, so proceed with your own risk. To use the feature, make sure to install the latest XGBoost (0.7.post3). The feature is currently quite limited, in that you are not allowed to modify the tree structure; only leaf values will be updated.Hope it helps!