Is it possible to perform early stopping using cross-validation or automatically sampling data from the provided train set without explicitly specifying an eval set?
lgb.cv supports early stopping, does it meet your request?
@guolinke I was actually looking for the same feature within the Sklearn API. Changed the title now
This is how sklearn's HistGradientBoostingClassifierperforms early stopping (by sampling the training data). There are significant benefits to this in terms of compatibility with the rest of the sklearn ecosystem, since most sklearn tools don't allow for passing validation data, or early stopping rounds.
Enabling this sort of functionality would allow a significant speedup in hyperparameter searching by taking advantage of both of sklearn's cross_val_scoreor RandomizedSearchCV, which are efficiently multiprocessed and can evaluate either multiple sets of parameters at once, or multiple folds at once. This scales better for many datasets than throwing more cores at LightGBM directly.
Ideally this would be implemented as an option of course, and not replace the existing behavior of course.
For your consideration, we did have a discussion about this with the scikit-learn maintainers in #2270. Using early stopping with a random subset of the data (not a validation set you create yourself) can lead to misleading results, because of information leaking from the training data to the validation data.
That being said...I personally favor adding automatic early stopping to the scikit-learn interface specifically, even if that means that we use train_test_split() like they do and set some early_stopping_rounds to pass through to LightGBM. The goal of the scikit-learn API is to allow people who are using scikit-learn to plug in LightGBM as a possible model in things like GridSearchCV. Even if we disagree with the decision that scikit-learn made about early stopping for HistGradientBoostingClassifier, now that that decision has been made I think that LightGBM's scikit-learn interface should adapt to it.
But I am not a Python maintainer here, so will defer to @guolinke and others.
Thanks @jameslamb - that's helpful background, and I see the concerns (especially since you can't pass a cv object into HistGradientBoostingClassifier, so are at the mercy of train_test_split).
I would find this functionality helpful despite these drawbacks, but it is obviously not essential.
@guolinke is it possible to add this functionality like @jameslamb mentioned?
Most helpful comment
For your consideration, we did have a discussion about this with the scikit-learn maintainers in #2270. Using early stopping with a random subset of the data (not a validation set you create yourself) can lead to misleading results, because of information leaking from the training data to the validation data.
That being said...I personally favor adding automatic early stopping to the scikit-learn interface specifically, even if that means that we use
train_test_split()like they do and set someearly_stopping_roundsto pass through to LightGBM. The goal of the scikit-learn API is to allow people who are using scikit-learn to plug in LightGBM as a possible model in things likeGridSearchCV. Even if we disagree with the decision that scikit-learn made about early stopping forHistGradientBoostingClassifier, now that that decision has been made I think that LightGBM's scikit-learn interface should adapt to it.But I am not a Python maintainer here, so will defer to @guolinke and others.