Lightgbm: [Feature Request] Auto early stopping in Sklearn API

Created on 18 Aug 2020 · 6Comments · Source: microsoft/LightGBM

Is it possible to perform early stopping using cross-validation or automatically sampling data from the provided train set without explicitly specifying an eval set?

feature request

Source

rohan-gt

👍3

Most helpful comment

For your consideration, we did have a discussion about this with the scikit-learn maintainers in #2270. Using early stopping with a random subset of the data (not a validation set you create yourself) can lead to misleading results, because of information leaking from the training data to the validation data.

That being said...I personally favor adding automatic early stopping to the scikit-learn interface specifically, even if that means that we use train_test_split() like they do and set some early_stopping_rounds to pass through to LightGBM. The goal of the scikit-learn API is to allow people who are using scikit-learn to plug in LightGBM as a possible model in things like GridSearchCV. Even if we disagree with the decision that scikit-learn made about early stopping for HistGradientBoostingClassifier, now that that decision has been made I think that LightGBM's scikit-learn interface should adapt to it.

But I am not a Python maintainer here, so will defer to @guolinke and others.

jameslamb on 21 Aug 2020

👍2

All 6 comments

lgb.cv supports early stopping, does it meet your request?

guolinke on 18 Aug 2020

@guolinke I was actually looking for the same feature within the Sklearn API. Changed the title now

rohan-gt on 18 Aug 2020

👍1

This is how sklearn's HistGradientBoostingClassifierperforms early stopping (by sampling the training data). There are significant benefits to this in terms of compatibility with the rest of the sklearn ecosystem, since most sklearn tools don't allow for passing validation data, or early stopping rounds.

Enabling this sort of functionality would allow a significant speedup in hyperparameter searching by taking advantage of both of sklearn's cross_val_scoreor RandomizedSearchCV, which are efficiently multiprocessed and can evaluate either multiple sets of parameters at once, or multiple folds at once. This scales better for many datasets than throwing more cores at LightGBM directly.

Ideally this would be implemented as an option of course, and not replace the existing behavior of course.

kmedved on 21 Aug 2020

But I am not a Python maintainer here, so will defer to @guolinke and others.

jameslamb on 21 Aug 2020

👍2

Thanks @jameslamb - that's helpful background, and I see the concerns (especially since you can't pass a cv object into HistGradientBoostingClassifier, so are at the mercy of train_test_split).

I would find this functionality helpful despite these drawbacks, but it is obviously not essential.

kmedved on 21 Aug 2020

@guolinke is it possible to add this functionality like @jameslamb mentioned?

rohan-gt on 9 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

R Package lgb.Dataset.construct() throwing api error: cannot open data file k

hack-r · 4Comments

Using C_API LGBM_BoosterPredictForMatSingleRow and it runs out Segmentation fault

BuGTEa · 3Comments

Cox Proportional Hazard Regression

zanemarkson · 3Comments

Interaction constraints

mayer79 · 3Comments

bug/segfault when using add_features_from and somewhat sparse data

ivinogra · 3Comments