Lightgbm: Subsampling rows with replacement

Created on 4 Nov 2017  Ā·  9Comments  Ā·  Source: microsoft/LightGBM

As far as I understand, the random forest (rf) mode differs from a genuine rf in three key aspects:

  1. Column subsampling is done per tree instead of per split.
  2. Row subsampling is done without replacement instead of with replacement.
  3. No OOB predictions

How realistic would it be to add a "bagging_with_replacement" option? If set to True, then the rows would be subsampled with replacement, mimicking the idea of bagging. This might even be an interesting option for non-rf application.

feature request help wanted

All 9 comments

Related issue #883.

bootstrap seems to be a suitable name. Any form of row subsampling would thus be required if either bagging_fraction < 1 or bootstrap = True.

It is not trivial to have this in the core algorithm.
However, a simple solution is using weight, that is, giving weight 0 to the no-sampled data, 1 to the ā€œone-sampleā€ data, and k to the ā€œk-sampleā€ data...
It is easy to have this in python package, since you can change the weight on each iteration.

Good hint. I was actually not aware that case weights could be updated during training. The Poisson distribution with mean 1 will provide an efficient and approximately correct weight distribution.

I am reopening this as

  1. I am still interested in this feature in order to be able to emulate random forests. Together with the relatively new "colsample_bynode", it would be very close to a native random forest.

  2. Sampling with replacement should be computationally more efficient than without.

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Hi @guolinke,

However, a simple solution is using weight, that is, giving weight 0 to the no-sampled data, 1 to the ā€œone-sampleā€ data, and k to the ā€œk-sampleā€ data...
It is easy to have this in python package, since you can change the weight on each iteration.

I understand that this is an old and closed issue, but may I ask you to elaborate on this solution a little bit more? How can one change sample weights for each tree in the random forest?

One solution could be using callbacks I suppose, is it the only way?

hi @rdbuf , yeah, callback is the most convenient way to do this, I think.

I see, thanks :)

Was this page helpful?
0 / 5 - 0 ratings