Lightgbm: [Feature Request] Auto early stopping in Sklearn API

Created on 18 Aug 2020  路  6Comments  路  Source: microsoft/LightGBM

Is it possible to perform early stopping using cross-validation or automatically sampling data from the provided train set without explicitly specifying an eval set?

feature request

Most helpful comment

For your consideration, we did have a discussion about this with the scikit-learn maintainers in #2270. Using early stopping with a random subset of the data (not a validation set you create yourself) can lead to misleading results, because of information leaking from the training data to the validation data.

That being said...I personally favor adding automatic early stopping to the scikit-learn interface specifically, even if that means that we use train_test_split() like they do and set some early_stopping_rounds to pass through to LightGBM. The goal of the scikit-learn API is to allow people who are using scikit-learn to plug in LightGBM as a possible model in things like GridSearchCV. Even if we disagree with the decision that scikit-learn made about early stopping for HistGradientBoostingClassifier, now that that decision has been made I think that LightGBM's scikit-learn interface should adapt to it.

But I am not a Python maintainer here, so will defer to @guolinke and others.

All 6 comments

lgb.cv supports early stopping, does it meet your request?

@guolinke I was actually looking for the same feature within the Sklearn API. Changed the title now

This is how sklearn's HistGradientBoostingClassifierperforms early stopping (by sampling the training data). There are significant benefits to this in terms of compatibility with the rest of the sklearn ecosystem, since most sklearn tools don't allow for passing validation data, or early stopping rounds.

Enabling this sort of functionality would allow a significant speedup in hyperparameter searching by taking advantage of both of sklearn's cross_val_scoreor RandomizedSearchCV, which are efficiently multiprocessed and can evaluate either multiple sets of parameters at once, or multiple folds at once. This scales better for many datasets than throwing more cores at LightGBM directly.

Ideally this would be implemented as an option of course, and not replace the existing behavior of course.

For your consideration, we did have a discussion about this with the scikit-learn maintainers in #2270. Using early stopping with a random subset of the data (not a validation set you create yourself) can lead to misleading results, because of information leaking from the training data to the validation data.

That being said...I personally favor adding automatic early stopping to the scikit-learn interface specifically, even if that means that we use train_test_split() like they do and set some early_stopping_rounds to pass through to LightGBM. The goal of the scikit-learn API is to allow people who are using scikit-learn to plug in LightGBM as a possible model in things like GridSearchCV. Even if we disagree with the decision that scikit-learn made about early stopping for HistGradientBoostingClassifier, now that that decision has been made I think that LightGBM's scikit-learn interface should adapt to it.

But I am not a Python maintainer here, so will defer to @guolinke and others.

Thanks @jameslamb - that's helpful background, and I see the concerns (especially since you can't pass a cv object into HistGradientBoostingClassifier, so are at the mercy of train_test_split).

I would find this functionality helpful despite these drawbacks, but it is obviously not essential.

@guolinke is it possible to add this functionality like @jameslamb mentioned?

Was this page helpful?
0 / 5 - 0 ratings