Lightgbm: Add "group" to lgb.cv for non-rank objectives

Created on 24 Jul 2018 · 3Comments · Source: microsoft/LightGBM

GroupKFold can be used in cross-validation to make sure a group is not split across two or more folds. This can be particularly important for preventing data/information leakages (compared to test time). Sklearn implements this in GroupKFold class, which is then used in lightgbm already.

However, this is only used for the Lambdarank objective, whereas, it would be very useful for classification and regression tasks.

With respect to classification tasks, when passing a group, it would no longer stratify the data. However, for large datasets, that don't have high class imbalances, group leakages can be much more problematic than minor differences in positive:negative label rations.

This could be implemented with something like,

if 'objective' in params and params['objective'] == 'lambdarank':
    #lambdarank code here...
elif group is not None:
    group_kfold = _LGBMGroupKFold(n_splits=nfold)
    folds = group_kfold.split(X=np.zeros(num_data), groups=group)

Happy to do a PR, with a little guidance :)

Source

JoshuaC3

👍1

Most helpful comment

@guolinke - reading over the code it seems that there might be a nicer way to include GroupKFold, and some of the other sklearn model selection cross validation iterators.

The LightGBM Python package uses the sklearn CV classes for Stratified and GroupKFold. Rather than adding code to allow _just_ GroupKFold, it would be a similar amount of work to pass an _uninitialised_ sk.model_selection object to lgb.cv and use this. This would also give us: GroupKFold, Time Series Split, LeaveOneGroupOut.

Even better, for a similar amount of work, we could pass an initialised sk.model_selection object. This would allow users to use _any_ of the sklearn model selection cross validation iterators. This would be my preferred option. The python code for this might look something like,

def _make_n_folds(full_data, folds, nfold, params, seed, fpreproc=None,
                              stratified=True, shuffle=True):
    """
    Make an n-fold list of Booster from random indices.
    """
    full_data = full_data.construct()
    num_data = full_data.num_data()
    if folds is not None:
        if not hasattr(folds, '__iter__'):
            raise AttributeError(
                "folds should be a generator or iterator of (train_idx, test_idx)"
            )
        elif  hasattr(folds, 'split'):
            if groups is not None:
                group_info = full_data.get_group().astype(int)
                flatted_group = np.repeat(range_(len(group_info)), repeats=group_info)
            folds = skf.split(X=np.zeros(num_data), y=full_data.get_label(),
                                     groups=flatted_group)

This said, we can currently get this functionality using sklearn outside of lgb.cv using,

folds = sk.model_selection.GroupKFold().split(X, y, groups)
# other operations on X, y, data, groups etc
lgb.cv(params, data, folds=folds)

the new code would look like,

folds_object = sk.model_selection.GroupKFold()
# other operations on X, y, data, groups etc
lgb.cv(params, data, folds=folds_object )

The main advantage of this would be that folds would be using X and y before they are passed to an lgb.Dataset and so could be a different shape, order, group etc if X and y are operated on before training. With the _new_ version, folds_object would be guaranteed to split on the lgb.Dataset.

In summary, we gain only a small amount a functionality by implementing only GroupKFold. Should we instead implement the initialised, safer sklearn split option for folds, despite the "less stable" functionality being achievable through another similar means?

JoshuaC3 on 2 Aug 2018

👍3 🎉1

All 3 comments

welcome to open the PR. We will help you after that : )

guolinke on 28 Jul 2018

👍1

@guolinke - reading over the code it seems that there might be a nicer way to include GroupKFold, and some of the other sklearn model selection cross validation iterators.

def _make_n_folds(full_data, folds, nfold, params, seed, fpreproc=None,
                              stratified=True, shuffle=True):
    """
    Make an n-fold list of Booster from random indices.
    """
    full_data = full_data.construct()
    num_data = full_data.num_data()
    if folds is not None:
        if not hasattr(folds, '__iter__'):
            raise AttributeError(
                "folds should be a generator or iterator of (train_idx, test_idx)"
            )
        elif  hasattr(folds, 'split'):
            if groups is not None:
                group_info = full_data.get_group().astype(int)
                flatted_group = np.repeat(range_(len(group_info)), repeats=group_info)
            folds = skf.split(X=np.zeros(num_data), y=full_data.get_label(),
                                     groups=flatted_group)

This said, we can currently get this functionality using sklearn outside of lgb.cv using,

folds = sk.model_selection.GroupKFold().split(X, y, groups)
# other operations on X, y, data, groups etc
lgb.cv(params, data, folds=folds)

the new code would look like,

folds_object = sk.model_selection.GroupKFold()
# other operations on X, y, data, groups etc
lgb.cv(params, data, folds=folds_object )

JoshuaC3 on 2 Aug 2018

👍3 🎉1

+1 for supporting sklearn CV iterator. Please feel free to create a PR!

StrikerRUS on 15 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

API design choice: why not enable early stopping by default?

NicolasHug · 3Comments

[R-package] Create portable configuration with 'configure' scripts

jameslamb · 3Comments

Adaptive Neural Trees (ANT)

chivee · 3Comments

Want to train LightGBM on residuals of another model's predictions, using custom loss

John-Curcio · 3Comments

[R-package] Windows R 4.0 jobs fail after "Building R Package"

jameslamb · 3Comments