Lightgbm: Subsampling rows with replacement

Created on 4 Nov 2017 · 9Comments · Source: microsoft/LightGBM

As far as I understand, the random forest (rf) mode differs from a genuine rf in three key aspects:

Column subsampling is done per tree instead of per split.
Row subsampling is done without replacement instead of with replacement.
No OOB predictions

How realistic would it be to add a "bagging_with_replacement" option? If set to True, then the rows would be subsampled with replacement, mimicking the idea of bagging. This might even be an interesting option for non-rf application.

feature request help wanted

Source

mayer79

All 9 comments

Related issue #883.

StrikerRUS on 5 Nov 2017

👍1

bootstrap seems to be a suitable name. Any form of row subsampling would thus be required if either bagging_fraction < 1 or bootstrap = True.

mayer79 on 6 Nov 2017

It is not trivial to have this in the core algorithm.
However, a simple solution is using weight, that is, giving weight 0 to the no-sampled data, 1 to the “one-sample” data, and k to the “k-sample” data...
It is easy to have this in python package, since you can change the weight on each iteration.

guolinke on 7 Nov 2017

👍1

Good hint. I was actually not aware that case weights could be updated during training. The Poisson distribution with mean 1 will provide an efficient and approximately correct weight distribution.

mayer79 on 7 Nov 2017

I am reopening this as

I am still interested in this feature in order to be able to emulate random forests. Together with the relatively new "colsample_bynode", it would be very close to a native random forest.
Sampling with replacement should be computationally more efficient than without.

mayer79 on 18 Mar 2020

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

StrikerRUS on 18 Mar 2020

Hi @guolinke,

However, a simple solution is using weight, that is, giving weight 0 to the no-sampled data, 1 to the “one-sample” data, and k to the “k-sample” data...
It is easy to have this in python package, since you can change the weight on each iteration.

I understand that this is an old and closed issue, but may I ask you to elaborate on this solution a little bit more? How can one change sample weights for each tree in the random forest?

One solution could be using callbacks I suppose, is it the only way?