GroupKFold can be used in cross-validation to make sure a group is not split across two or more folds. This can be particularly important for preventing data/information leakages (compared to test time). Sklearn implements this in GroupKFold class, which is then used in lightgbm already.
However, this is only used for the Lambdarank objective, whereas, it would be very useful for classification and regression tasks.
With respect to classification tasks, when passing a group, it would no longer stratify the data. However, for large datasets, that don't have high class imbalances, group leakages can be much more problematic than minor differences in positive:negative label rations.
This could be implemented with something like,
if 'objective' in params and params['objective'] == 'lambdarank':
#lambdarank code here...
elif group is not None:
group_kfold = _LGBMGroupKFold(n_splits=nfold)
folds = group_kfold.split(X=np.zeros(num_data), groups=group)
Happy to do a PR, with a little guidance :)
welcome to open the PR. We will help you after that : )
@guolinke - reading over the code it seems that there might be a nicer way to include GroupKFold, and some of the other sklearn model selection cross validation iterators.
The LightGBM Python package uses the sklearn CV classes for Stratified and GroupKFold. Rather than adding code to allow _just_ GroupKFold, it would be a similar amount of work to pass an _uninitialised_ sk.model_selection object to lgb.cv and use this. This would also give us: GroupKFold, Time Series Split, LeaveOneGroupOut.
Even better, for a similar amount of work, we could pass an initialised sk.model_selection object. This would allow users to use _any_ of the sklearn model selection cross validation iterators. This would be my preferred option. The python code for this might look something like,
def _make_n_folds(full_data, folds, nfold, params, seed, fpreproc=None,
stratified=True, shuffle=True):
"""
Make an n-fold list of Booster from random indices.
"""
full_data = full_data.construct()
num_data = full_data.num_data()
if folds is not None:
if not hasattr(folds, '__iter__'):
raise AttributeError(
"folds should be a generator or iterator of (train_idx, test_idx)"
)
elif hasattr(folds, 'split'):
if groups is not None:
group_info = full_data.get_group().astype(int)
flatted_group = np.repeat(range_(len(group_info)), repeats=group_info)
folds = skf.split(X=np.zeros(num_data), y=full_data.get_label(),
groups=flatted_group)
This said, we can currently get this functionality using sklearn outside of lgb.cv using,
folds = sk.model_selection.GroupKFold().split(X, y, groups)
# other operations on X, y, data, groups etc
lgb.cv(params, data, folds=folds)
the new code would look like,
folds_object = sk.model_selection.GroupKFold()
# other operations on X, y, data, groups etc
lgb.cv(params, data, folds=folds_object )
The main advantage of this would be that folds would be using X and y before they are passed to an lgb.Dataset and so could be a different shape, order, group etc if X and y are operated on before training. With the _new_ version, folds_object would be guaranteed to split on the lgb.Dataset.
In summary, we gain only a small amount a functionality by implementing only GroupKFold. Should we instead implement the initialised, safer sklearn split option for folds, despite the "less stable" functionality being achievable through another similar means?
+1 for supporting sklearn CV iterator. Please feel free to create a PR!
Most helpful comment
@guolinke - reading over the code it seems that there might be a nicer way to include GroupKFold, and some of the other sklearn model selection cross validation iterators.
The LightGBM Python package uses the sklearn CV classes for Stratified and GroupKFold. Rather than adding code to allow _just_ GroupKFold, it would be a similar amount of work to pass an _uninitialised_
sk.model_selectionobject tolgb.cvand use this. This would also give us: GroupKFold, Time Series Split, LeaveOneGroupOut.Even better, for a similar amount of work, we could pass an initialised
sk.model_selectionobject. This would allow users to use _any_ of the sklearn model selection cross validation iterators. This would be my preferred option. The python code for this might look something like,This said, we can currently get this functionality using sklearn outside of
lgb.cvusing,the new code would look like,
The main advantage of this would be that
foldswould be usingXandybefore they are passed to anlgb.Datasetand so could be a different shape, order, group etc if X and y are operated on before training. With the _new_ version,folds_objectwould be guaranteed to split on thelgb.Dataset.In summary, we gain only a small amount a functionality by implementing only GroupKFold. Should we instead implement the initialised, safer sklearn split option for
folds, despite the "less stable" functionality being achievable through another similar means?