Lightgbm: Parameter min_data_in_leaf ignored by lightgbm.cv()

Created on 1 Sep 2020  路  38Comments  路  Source: microsoft/LightGBM

Environment info

Component: Python package

Operating System: Windows 10

CPU/GPU model: GeForce 960M

CMake version: 3.18.2

Python version: 3.8.3


LightGBM version: 3.0.0

Error message and / or logs

[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=10

Reproducible example(s)

param = {
         'min_data_in_leaf':200,
         'feature_pre_filter' : False,
         'objective': 'multiclass',
         'metric': 'multi_logloss',
         'num_class':21}

cvm = lightgbm.cv(param, nfold=4, train_set = train_data, categorical_feature=categorical_feature)

Steps to reproduce

  1. Run the script above.
  2. Receive the warning, [LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=10.
  3. Notice the error lists min_data_in_leaf=10 instead of 200.
bug

Most helpful comment

is the hyperparameters unchangeable during searching in cv?

Good question. It is indeed a path-dependent issue, but it requires a conjunction of as many as three conditions:

  • at least two models (CV or single) to be trained in a sequence (in a Notebook or a .py script) (kudos to @Merudo) AND
  • a synonymous parameter name passed together with the primary one AND
  • a change in the primary parameter's name to default value (a change measured from previous model run to the current one)

Then the default parameter value passed via primary name takes precedence during model training over the custom value passed via the synonym. The model object preserves both, so a reader (extracting model object from object storage in systems like mlflow etc) naturally assumes that the custom value was used for training (which is not true here). I tried various scenarios (paths) and synonyms, and all required these three conditions to get an incorrect metric (one based on default despite custom value passed to a synonym). I can imagine it extends to other hyperparameters too. Reloading dataset before each model is trained does not help. I hope someone can reproduce this using independent tests.

For example:

{'feature_pre_filter': False,
 'metric': 'multi_logloss',
 'min_data': 20,
 'min_data_in_leaf': 1,
 'num_class': 3,
 'objective': 'multiclass',
 'verbose': -1}
0.11551373151193695
In repeat 0 we expect the user has set CUSTOM value to min_data and/or min_data_in_leaf
.. because model metric (0.11551) is consistent with using custom value(s)
.. but what she actually used for min_data was 20
.. and what she actually used for min_data_in_leaf was 1
.. min_data changed from 20 to 20
.. min_data_in_leaf changed from 1 to 1
.. as expected

{'feature_pre_filter': False,
 'metric': 'multi_logloss',
 'min_child_samples': 1,
 'min_data_in_leaf': 20,
 'num_class': 3,
 'objective': 'multiclass',
 'verbose': -1}
0.09750119910994513
In repeat 0 we expect the user has set DEFAULT value to both min_child_samples and min_data_in_leaf..
.. because model metric (0.09750) is consistent with using default values
.. but what she actually used for min_child_samples was 1
.. and what she actually used for min_data_in_leaf was 20
.. min_child_samples changed from 20 to 1
.. min_data_in_leaf changed from 1 to 20

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-1-f6519d947615> in <module>
    106                         print(".. %s changed from %d to %d" % (DEFAULT_NAME, prev_test_val_for_def_name, test_val_for_def_name))
    107                         # assert(test_val == DEFAULT_VALUE) # OK (never fails)
--> 108                         assert(test_val == DEFAULT_VALUE and test_val_for_def_name == DEFAULT_VALUE)
    109                         print(".. as expected\n")
    110 

AssertionError: 

All 38 comments

@Merudo I can't reproduce the issue given only cv params. Please provide MCVE.

from sklearn.datasets import load_digits

import lightgbm as lgb

X, y = load_digits(n_class=3, return_X_y=True)
train_data = lgb.Dataset(X, y)

categorical_feature = [0, 2]
param = {
         'min_data_in_leaf':200,
         'feature_pre_filter' : False,
         'objective': 'multiclass',
         'metric': 'multi_logloss',
         'num_class':3}

cvm = lgb.cv(param, nfold=4, train_set=train_data, categorical_feature=categorical_feature)

User-specified custom setting of this and many other hyperparameters seem to be used correctly in v3.0.0, at least judging from the fact that the user-specified values are preserved in the model object. I don't know if it's strong enough evidence. It depends if the custom value can be replaced by default at training time, and then incorrectly preserved in the model object as if it were used.

@Merudo I observed similar warnings issued by LightGBM when passing hyperparmeter synonyms to a training function. I've reproduced your warning message in v3.0.0 by passing such synonyms for multiple hyperparameters (not just for min_data_in_leaf), all set equal to their default values (so as to be certain which values are actually used),

Fold 5 of 5 2020-09-04 08:35:27.761541
[LightGBM] [Warning] bin_construct_sample_cnt is set=200000, subsample_for_bin=200000 will be ignored. Current value: bin_construct_sample_cnt=200000
[LightGBM] [Warning] feature_fraction is set=1.0, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=1.0
[LightGBM] [Warning] lambda_l1 is set=0, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0
[LightGBM] [Warning] num_threads is set=16, n_jobs=-1 will be ignored. Current value: num_threads=16
[LightGBM] [Warning] min_gain_to_split is set=0.0, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.0
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_fraction is set=1.0, subsample=1.0 will be ignored. Current value: bagging_fraction=1.0
[LightGBM] [Warning] min_data_in_leaf is set=20, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=20
[LightGBM] [Warning] bagging_freq is set=0, subsample_freq=0 will be ignored. Current value: bagging_freq=0

The trigger seems to be the use of a synonym (rather than the "primary" parameter name - i.e. the one defined in the lightgbm.LGBMClassifier). The question is what happens if a synonym is passed with non-default value - whether default is used then (which would indeed be a bug)?

Simplifying example by @StrikerRUS I could reproduce the warnings but not the problem (of custom value being ignored). At least the logs indicate that a conflict of two synonyms is resolved in favor of the primary name (as defined in Parameters).

from sklearn.datasets import load_digits

import lightgbm as lgb

X, y = load_digits(n_class=2, return_X_y=True)
train_data = lgb.Dataset(X, y)

param = {         
         'min_child_samples': 11,
         'min_data_in_leaf': 22,
         'bagging_freq': 22,
         'subsample_freq': 11,
         'feature_pre_filter' : False,
         'objective': 'binary',
         'metric': 'auc'}

cvm = lgb.cv(param, nfold=4, train_set=train_data)

Output:

[LightGBM] [Warning] min_data_in_leaf is set=22, min_child_samples=11 will be ignored. Current value: min_data_in_leaf=22
[LightGBM] [Warning] bagging_freq is set=22, subsample_freq=11 will be ignored. Current value: bagging_freq=22
[..]

But than again we cannot observe lightgbm.cv models (they are not preserved), so w don't really know which value was used. I will have to try this in production pipelines, where we do preserve CV model objects.

Yes, it looks like the problem of using multiple aliases at the same time. But without full example we cannot be 100% sure.

At least the logs indicate that a conflict of two synonyms is resolved in favor of the primary name

Conflict resolution is deterministic, but I think there is no guarantee that the primary name will be used:
https://github.com/microsoft/LightGBM/blob/82e2ff7a018f2466f8102c80a7bafb9324238d19/include/LightGBM/config.h#L1053-L1061

I run yet another test, using a realistic LGBMRegressor with custom cross-validation function (a wrapper on lightgbm.cv), where we preserve model objects. Logs and model objects data suggest that custom hyperparameter values were correctly used. Note that to trigger the warning I did not have to supply two synonyms like above, it was enough to pass non-primary names. I believe this behavior was already in previous versions.

[LightGBM] [Warning] seed is set=123, random_state=123 will be ignored. Current value: seed=123
[LightGBM] [Warning] min_data_in_leaf is set=9, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=9
[..]
[LightGBM] [Warning] num_threads is set=16, n_jobs=16 will be ignored. Current value: num_threads=16
[LightGBM] [Warning] bin_construct_sample_cnt is set=350000, subsample_for_bin=200000 will be ignored. Current value: bin_construct_sample_cnt=350000
[..]
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.140925 seconds.
You can set `force_col_wise=true` to remove the overhead.
[..]
A single [..] model (LGBMRegressor(bagging_fraction=0.85, bagging_freq=1,
              bin_construct_sample_cnt=350000, boost_from_average=True,
              device='cpu', feature_fraction=0.95, feature_fraction_bynode=0.7,
              is_unbalance=False, lambda_l1=0.158, lambda_l2=1.775,
              learning_rate=0.063, max_bin=1023, max_depth=85,
              min_child_weight=0.1, min_data_in_leaf=9,
              min_gain_to_split=8.9e-14, n_estimators=1100, n_jobs=16,
              num_iterations=1100, num_leaves=100, num_threads=16,
              objective='regression', random_state=123, seed=123, silent=False,
              tweedie_variance_power=1.5)) had the following metrics on the test set:
[..]

The only problem is that standardized accuracy metrics (here: R^2) have deteriorated (by at least half a percentage point) in v3.0.0 on all test sets as compared to v2.3.1 (where I can reproduce archived metrics exactly). Because this regression model has optimized hyperparameters, a deterioration in accuracy is consistent with defaults being used instead of custom hyperparameter values... (despite all of the above evidence to the contrary). I will try do distinguish this from other sources of model accuracy deterioration by trying to replicate another model - this time with default hyperparameters (which should be immune to the issue reported by @Merudo ).

Using default hyperparameters (but keeping all other settings unchanged from the above regression) I reproduced the metrics known from v2.3.1 now in v3.0.0. On the other hand, as reported above, it was impossible to reproduce known metrics using non-default (optimized) hyperparameters in v3.0.0, i.e. match the improved metrics obtained using non-default (optimized) hyperparameters in v2.3.1.

So @StrikerRUS yes, there seems to be a genuine bug here (or elsewhere where LightGBM can fail to use non-default hyperparameters).

Using default hyperparameters (but keeping all other settings unchanged from the above regression) I reproduced the metrics known from v2.3.1 now in v3.0.0

We have a lot of breaking changes in 3.0.0, so it maybe just a coincidence that your metrics are the same in 2.3.1 and 3.0.0 for a particalar set of params.

As I know, there have been no changes in default values of params since 2.3.1.

it maybe just a coincidence that your metrics are the same in 2.3.1 and 3.0.0 for a particalar set of params.

In general yes, but not in this case, for various reasons. One of them is that exact equality as a confluence of several changes acting in opposite directions is a rather unlikely explanation (and one not passing the Occam razor test if I may say so:) if we observe it on 4 mostly uncorrelated metrics each computed on 3 different datasets, i.e. on 12 values in total...

For instance I can reveal here MSE values, as these metrics are dataset-dependent:

  1. default hyperparameters
    lightgbm: v2.3.1 (py37) | v3.0.0 (py38)
    test set1: 142091.4 | 142091.4
    test set2: 166397.3 | 166397.3
    test set3: 149005.5 | 149005.5

vs.

  1. custom hyperparameters
    lightgbm: v2.3.1 (py37) | v2.3.1 (py38) | v3.0.0 (py38)
    test set1: 137259.1 | 137259.1 | 140917.6
    test set2: 178320.1 | 178320.1 | 188110.4
    test set3: 145246.7 | 145246.7 | 151146.2

But I will nevertheless try to produce a self-contained example :)

It seems lgb.cv stores parameters entered in previous calls in the dataset, and uses those instead of the most recent parameters.

Example:

from sklearn.datasets import load_digits

import lightgbm as lgb

X, y = load_digits(n_class=3, return_X_y=True)
train_data = lgb.Dataset(X, y)

categorical_feature = [0, 2]
param = {
         'min_data_in_leaf':15,
         'min_child_samples': 30,
         'feature_pre_filter' : False,
         'objective': 'multiclass',
         'metric': 'multi_logloss',
         'num_class':3}

cvm = lgb.cv(param, nfold=4, train_set=train_data, categorical_feature=categorical_feature)

param['min_data_in_leaf'] = 20

cvm = lgb.cv(param, nfold=4, train_set=train_data, categorical_feature=categorical_feature)

On the second call, I get the warnings:

[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=20, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=20
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=20, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=20
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000639 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 778
[LightGBM] [Info] Number of data points in the train set: 402, number of used features: 57
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=20, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=20
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000651 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 778
[LightGBM] [Info] Number of data points in the train set: 403, number of used features: 57
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=20, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=20
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000526 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 778
[LightGBM] [Info] Number of data points in the train set: 403, number of used features: 57
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=20, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=20
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000535 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 778
[LightGBM] [Info] Number of data points in the train set: 403, number of used features: 57
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15

It warns that min_data_in_leaf is set to 15, while it was instead set to 20.

Workaround

A workaround is to rebuild the dataset after each call to lgb.cv. For exmple:

from sklearn.datasets import load_digits

import lightgbm as lgb

X, y = load_digits(n_class=3, return_X_y=True)
train_data = lgb.Dataset(X, y)

categorical_feature = [0, 2]
param = {
         'min_data_in_leaf':15,
         'min_child_samples': 30,
         'feature_pre_filter' : False,
         'objective': 'multiclass',
         'metric': 'multi_logloss',
         'num_class':3}

cvm = lgb.cv(param, nfold=4, train_set=train_data, categorical_feature=categorical_feature)

train_data = lgb.Dataset(X, y)    # rebuild the dataset to forget previous parameters entered
param['min_data_in_leaf'] = 20

cvm = lgb.cv(param, nfold=4, train_set=train_data, categorical_feature=categorical_feature)

In this case, there is no reference to min_data_in_leaf being 15.

@mirekphd did you use bagging in your hyperparameters searching?
if yes, could you try to disable bagging and compare them again?

@Merudo @mirekphd it is a long thread, is the hyperparameters unchangeable during searching in cv?

I think it is related to #2594, the dataset objective will preserve its initial values.
The affected parameters are:

https://github.com/microsoft/LightGBM/blob/5b5f4e39a9d9b075ef0aedafb9e400ede521a34f/python-package/lightgbm/basic.py#L810-L828

As I know, there have been no changes in default values of params since 2.3.1.

I wrote a comprehensive metric-based unit test to cover all scenarios (synonyms of min_data_in_leaf) , because I expected it to pass at least in the previous lightgbm release... it did not:

sklearn: 0.23.2
lightgbm: 2.3.1
[..]
{'feature_pre_filter': False,
 'metric': 'multi_logloss',
 'min_data_in_leaf': 20,
 'min_data_per_leaf': 1,
 'num_class': 3,
 'objective': 'multiclass',
 'verbose': -1}
In repeat 1 user must have set DEFAULT value to both min_data_per_leaf and min_data_in_leaf..
.. because metric was 0.13516
.. but what she actually used for min_data_per_leaf was 1
.. and what she actually used for min_data_in_leaf was 20

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-28-4df963d4f357> in <module>
     78                     print(".. but what she actually used for %s was %d" % (test_name, test_val))
     79                     print(".. and what she actually used for %s was %d" % (DEFAULT_NAME, test_val_for_def_name))
---> 80                     assert(test_val == DEFAULT_VALUE and test_val_for_def_name == DEFAULT_VALUE)
     81                     print(".. as expected\n")
     82 

AssertionError: 

In version v3.0.0 metrics do change (for this model), but after adjustment for that, the test fails in the same place as before:

sklearn: 0.23.2
lightgbm: 3.0.0

{'feature_pre_filter': False,
 'metric': 'multi_logloss',
 'min_data_in_leaf': 20,
 'min_data_per_leaf': 1,
 'num_class': 3,
 'objective': 'multiclass',
 'verbose': -1}
In repeat 1 user must have set DEFAULT value to both min_data_per_leaf and min_data_in_leaf..
.. because metric was 0.09750
.. but what she actually used for min_data_per_leaf was 1
.. and what she actually used for min_data_in_leaf was 20

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-4-4df963d4f357> in <module>
     78                     print(".. but what she actually used for %s was %d" % (test_name, test_val))
     79                     print(".. and what she actually used for %s was %d" % (DEFAULT_NAME, test_val_for_def_name))
---> 80                     assert(test_val == DEFAULT_VALUE and test_val_for_def_name == DEFAULT_VALUE)
     81                     print(".. as expected\n")
     82 

AssertionError: 

I think it is related to #2594, the dataset objective will preserve its initial values.
The affected parameters are:

Except that the parameter @Merudo first noticed as problematic and we used it since then, min_data_in_leaf, is not on this list :)

The problem apparently existed in an even earlier version of lightgbm, v2.2.3, the oldest one I can test now:

sklearn: 0.21.3
lightgbm: 2.2.3
[..]
{'feature_pre_filter': False,
 'metric': 'multi_logloss',
 'min_data_in_leaf': 20,
 'min_data_per_leaf': 1,
 'num_class': 3,
 'objective': 'multiclass',
 'verbose': -1}
In repeat 1 user must have set DEFAULT value to both min_data_per_leaf and min_data_in_leaf..
.. because metric was 0.13847
.. but what she actually used for min_data_per_leaf was 1
.. and what she actually used for min_data_in_leaf was 20

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-5-e06fb7db843d> in <module>
     81                     print(".. but what she actually used for %s was %d" % (test_name, test_val))
     82                     print(".. and what she actually used for %s was %d" % (DEFAULT_NAME, test_val_for_def_name))
---> 83                     assert(test_val == DEFAULT_VALUE and test_val_for_def_name == DEFAULT_VALUE)
     84                     print(".. as expected\n")
     85 

AssertionError: 

Workaround

A workaround is to rebuild the dataset after each call to lgb.cv.

Sorry, the workaround does not seem to help (tested in v3.0.0) :)

{'feature_pre_filter': False,
 'metric': 'multi_logloss',
 'min_data_in_leaf': 20,
 'min_data_per_leaf': 1,
 'num_class': 3,
 'objective': 'multiclass',
 'verbose': -1}
In repeat 1 user must have set DEFAULT value to both min_data_per_leaf and min_data_in_leaf..
.. because metric was 0.09750
.. but what she actually used for min_data_per_leaf was 1
.. and what she actually used for min_data_in_leaf was 20

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-5-cc0e847f53f6> in <module>
     78                     print(".. but what she actually used for %s was %d" % (test_name, test_val))
     79                     print(".. and what she actually used for %s was %d" % (DEFAULT_NAME, test_val_for_def_name))
---> 80                     assert(test_val == DEFAULT_VALUE and test_val_for_def_name == DEFAULT_VALUE)
     81                     print(".. as expected\n")
     82 

AssertionError: 

@mirekphd I'm afraid I don't understand the purpose of your tests. It seems you are testing whether the parameters passed are equal to the default parameter values?

You specified the value of min_data_per_leaf to be 1 while the default is 20, so the test is failing as expected.

@Merudo Please note that there is an issue with your code (https://github.com/microsoft/LightGBM/issues/3346#issuecomment-687245227): you have two aliases of the same parameter in one params dict:

         'min_data_in_leaf':15,
         'min_child_samples': 30,

https://lightgbm.readthedocs.io/en/latest/Parameters.html#min_data_in_leaf

I believe, consistently using only one parameter name (min_data_in_leaf for example) solves the problem:

         'min_data_in_leaf':15,
#          'min_child_samples': 30,

I'm afraid I don't understand the purpose of your tests. It seems you are testing whether the parameters passed are equal to the default parameter values?

You specified the value of min_data_per_leaf to be 1 while the default is 20, so the test is failing as expected.

No, no, that part of the test (which checks if both default parameter values were passed by the user) would be executed only if the model metric was equal to a value expected from such default parameters (logloss_mean= np.mean(cvm['multi_logloss-mean']) archived when default parameters were surely used for training). I was trying to spare you the code, but here goes the relevant part (if it makes things any clearer:) :

                if (abs(np.round(logloss_mean - DEFAULT_MEAN,8)) < 1e-8):

                    # metric indicates that default was used for training
                    print("In repeat %d user must have set DEFAULT value to both %s and %s.." % (int(repeat_num)+1, test_name, DEFAULT_NAME))
                    print(".. because metric was %.5f" % logloss_mean)

                    # so test if *both* parameter names (custom and default) had *default* value
                    print(".. but what she actually used for %s was %d" % (test_name, test_val))                    
                    print(".. and what she actually used for %s was %d" % (DEFAULT_NAME, test_val_for_def_name))                    
                    # assert(test_val == DEFAULT_VALUE) # OK (never fails)
                    assert(test_val == DEFAULT_VALUE and test_val_for_def_name == DEFAULT_VALUE)
                    print(".. as expected\n")

                else:

                    # metric indicates that default was NOT used for training
                    print("In repeat %d user must have set CUSTOM value to %s or %s" % (int(repeat_num)+1, test_name, DEFAULT_NAME))
                    print(".. because metric was %.5f" % logloss_mean)

                    # so test if *either* parameter name had *custom* (non-default) value
                    print(".. but what she actually used for %s was %d" % (test_name, test_val))                    
                    print(".. and what she actually used for %s was %d" % (DEFAULT_NAME, test_val_for_def_name))                    
                    # assert(test_val != DEFAULT_VALUE) # OK (never fails)
                    assert(test_val != DEFAULT_VALUE or test_val_for_def_name != DEFAULT_VALUE)
                    print(".. as expected\n")

@mirekphd did you use bagging in your hyperparameters searching?
if yes, could you try to disable bagging and compare them again?

After disabling bagging (by setting bagging_fraction to 1.0 and bagging_freq to 0) regression model metrics have indeed changed (on one test set improved, on two other deteriorated) from the one obtained with default hyperparameters only, and consistently improved from the values obtained in v3.0.0 with optimized hyperparameters. This would indicate that defaults were not used, and it was possible to set custom values, at least for bagging. When using defaults, switching off bagging did not change results (as expected, since bagging is off by default).

For instance I can reveal here MSE values, as these metrics are dataset-dependent:

default hyperparameters
lightgbm: v2.3.1 (py37) | v3.0.0 (py38) | v3.0.0 (py38) (no bag)
test set1: 142091.4 | 142091.4 | 142091.4
test set2: 166397.3 | 166397.3 | 166397.3
test set3: 149005.5 | 149005.5 | 149005.5

vs.

custom hyperparameters
lightgbm: v2.3.1 (py37) | v2.3.1 (py38) | v3.0.0 (py38) | v3.0.0 (py38) (no bag)
test set1: 137259.1 | 137259.1 | 140917.6 | 139325.0
test set2: 178320.1 | 178320.1 | 188110.4 | 179134.3
test set3: 145246.7 | 145246.7 | 151146.2 | 150236.3

I believe, consistently using only one parameter name (min_data_in_leaf for example) solves the problem:

         'min_data_in_leaf':15,
#          'min_child_samples': 30,

Yup, this is true, a metric-based regression test where there is no need for name conflict resolution never fails:

                # param = {**orig_param, **{test_name:test_val}} # OK (never fails)
                param = {**orig_param, **{test_name:test_val}, **{DEFAULT_NAME:test_val_for_def_name}}

@Merudo
I think @StrikerRUS is right.
BTW, even with alias in parameter, the results are also correct.
You can check the return results, they are different.

There is an additional warning when set params to the dataset after first cv.

>>> train_data._update_params(param)
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=20, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=20
<lightgbm.basic.Dataset object at 0x000002A6B2A0BFA0>

But the value should be right. I will fix the additional false warning.

@mirekphd We have a new change to ensure the consistency of bagging when using multi-threading.
This change will result in different bagging results, so the accuracy will be different when using bagging.
Is that consistent worst when using bagging, compared with the old versions?

@mirekphd We have a new change to ensure the consistency of bagging when using multi-threading.

I see, I will perform more tests then, moving my replication problems for non-default hyperparams to a separate Github issue. I will try to attribute changes in accuracy to changes in code.

Yeah, it is better to create one issue for one problem.

is the hyperparameters unchangeable during searching in cv?

Good question. It is indeed a path-dependent issue, but it requires a conjunction of as many as three conditions:

  • at least two models (CV or single) to be trained in a sequence (in a Notebook or a .py script) (kudos to @Merudo) AND
  • a synonymous parameter name passed together with the primary one AND
  • a change in the primary parameter's name to default value (a change measured from previous model run to the current one)

Then the default parameter value passed via primary name takes precedence during model training over the custom value passed via the synonym. The model object preserves both, so a reader (extracting model object from object storage in systems like mlflow etc) naturally assumes that the custom value was used for training (which is not true here). I tried various scenarios (paths) and synonyms, and all required these three conditions to get an incorrect metric (one based on default despite custom value passed to a synonym). I can imagine it extends to other hyperparameters too. Reloading dataset before each model is trained does not help. I hope someone can reproduce this using independent tests.

For example:

{'feature_pre_filter': False,
 'metric': 'multi_logloss',
 'min_data': 20,
 'min_data_in_leaf': 1,
 'num_class': 3,
 'objective': 'multiclass',
 'verbose': -1}
0.11551373151193695
In repeat 0 we expect the user has set CUSTOM value to min_data and/or min_data_in_leaf
.. because model metric (0.11551) is consistent with using custom value(s)
.. but what she actually used for min_data was 20
.. and what she actually used for min_data_in_leaf was 1
.. min_data changed from 20 to 20
.. min_data_in_leaf changed from 1 to 1
.. as expected

{'feature_pre_filter': False,
 'metric': 'multi_logloss',
 'min_child_samples': 1,
 'min_data_in_leaf': 20,
 'num_class': 3,
 'objective': 'multiclass',
 'verbose': -1}
0.09750119910994513
In repeat 0 we expect the user has set DEFAULT value to both min_child_samples and min_data_in_leaf..
.. because model metric (0.09750) is consistent with using default values
.. but what she actually used for min_child_samples was 1
.. and what she actually used for min_data_in_leaf was 20
.. min_child_samples changed from 20 to 1
.. min_data_in_leaf changed from 1 to 20

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-1-f6519d947615> in <module>
    106                         print(".. %s changed from %d to %d" % (DEFAULT_NAME, prev_test_val_for_def_name, test_val_for_def_name))
    107                         # assert(test_val == DEFAULT_VALUE) # OK (never fails)
--> 108                         assert(test_val == DEFAULT_VALUE and test_val_for_def_name == DEFAULT_VALUE)
    109                         print(".. as expected\n")
    110 

AssertionError: 

And if you need proof that LightGBM somehow preserves hyperparameters from previous runs, set feature_pre_filter to True, and then in next iteration try to reduce min_data_in_leaf, and you should get this error from the model training function (see also related issue: https://github.com/optuna/optuna/issues/1718):

.. min_child_samples changed from 20 to 20
.. min_data_in_leaf (default name) changed from 20 to 1
{'feature_pre_filter': True,
 'metric': 'multi_logloss',
 'min_child_samples': 20,
 'min_data_in_leaf': 1,
 'num_class': 3,
 'objective': 'multiclass',
 'verbose': -1}

---------------------------------------------------------------------------
LightGBMError                             Traceback (most recent call last)
<ipython-input-2-5734e0346237> in <module>
     94 
     95                     cvm = None
---> 96                     cvm = lgb.cv(param, nfold=4, train_set=train_data, categorical_feature=categorical_feature)
     97 
     98                     logloss_mean = 0

/opt/conda/lib/python3.8/site-packages/lightgbm/engine.py in cv(params, train_set, num_boost_round, folds, nfold, stratified, shuffle, metrics, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, fpreproc, verbose_eval, show_stdv, seed, callbacks, eval_train_metric, return_cvbooster)
    552         params['metric'] = metrics
    553 
--> 554     train_set._update_params(params) \
    555              ._set_predictor(predictor) \
    556              .set_feature_name(feature_name) \

/opt/conda/lib/python3.8/site-packages/lightgbm/basic.py in _update_params(self, params)
   1434                     self._free_handle()
   1435                 else:
-> 1436                     raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
   1437         return self
   1438 

LightGBMError: Reducing `min_data_in_leaf` with `feature_pre_filter=true` may cause unexpected behaviour for features that were pre-filtered by the larger `min_data_in_leaf`.
You need to set `feature_pre_filter=false` to dynamically change the `min_data_in_leaf`.

set feature_pre_filter to True,

feature_pre_filter is a Dataset parameter. Indeed, those parameters are stored in dataset object and cannot be changed in train/cv functions. You should create new dataset to change them.

@mirekphd I think the purpose of parameter alias is for the convenience, not for the user to use a different alias every time in sequential training.
If this indeed is a problem, we can throw the error directly when the user tries this behavior, or disable the alias function totally.

Dataset parameter. Indeed, those parameters are stored in dataset object and cannot be changed in train/cv functions.

So this is why both @Merudo and https://github.com/optuna/optuna/pull/1774 have set this new hyperparameter feature_pre_filter to False. But leaving it at its breaking default of True and using the workaround you suggested ("create new dataset to change them.") does not help solve the original issue (default value is used for training - despite user wishes - when non-alias (primary) name min_data_in_leaf changes from any non-default value to its default value of 20).

_As a note to self, thanks to the above comment I realized that one of the Dataset parameters, max_bin was actually used (with non-default values) in our hyperparameter searchers (including those resulting in the regression model whose metrics could not be replicated in v3.0.0 on non-default parameters as I reported here). I switched it off (set to default) and re-run the pipeline, but it did not help (metrics did not improve towards those from v2.3.1, actually got worse :)._

not for the user to use a different alias every time in sequential training.
If this indeed is a problem, we can throw the error directly when the user tries this behavior, or disable the alias function totally.

I suspect the special conditions above, which may seem rare and bordering on user error, may occur in unexpected places as a side-effect of interaction with other pieces of code, on higher levels of abstraction, such as ray tune invoking optuna invoking lightgbm... or multi-stage custom pipelines where different sets of hyperparameters are used at each stage (e.g. to quickly select features on all columns using defaults and then to fit a parsimonious but heavily optimized deployment model based on selected features). In such cases current lightgbm warnings invariably get suppressed (e.g. by verbose=-1).

Throwing an error on seeing a "duplicate" in the dictionary passed to params (duplicate = any alias + primary name or a pair of aliases) would help us to pinpoint such situations in complex pipelines. We could stop and repair inadvertent hyperparameter changes between models (maybe even occurring inside cv functions under some special conditions?) and parameter duplications (especially conflicting ones). I'd argue it would be best to have it as a non-default option of train/cv, e.g. strict_alias_check=False?

Yeah, it is better to create one issue for one problem.

That other issue of replicating SOTA metrics known from v2.3.1 has become less important because v3.0.0 has shown a 2-fold training speed improvements in hyperparameter optimization in parsimonious production models, which allowed us to quickly improve on previous SOTA metrics. So the conclusion is that one may need to re-optimize hyperparameters of existing models given all the improvements made in v3.0.0 to regain previous accuracy levels, but that it will take only half the time as it did previously, provided that models are sufficiently parsimonious. Do you think it merits opening a separate Issue, or is there already one where these two observations would fit nicely?

@mirekphd I am afraid that the new bagging solution caused the performance drop.
If with multiply seeds, with the same bagging fraction/frequency, but v3.0 is worse. I think we can investigate it.

Regarding bagging_fraction, I find that it has no effect in the R implementation of LightGBM v3.

data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
data(agaricus.test, package = "lightgbm")
test <- agaricus.test
dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label)
params <- list(objective = "regression", metric = "l2")
valids <- list(test = dtest)

set.seed(123)

nobag_model <- lgb.train(
  params = params
  , data = dtrain
  , nrounds = 100L
  , valids = valids
  , learning_rate = 1.0
  , early_stopping_rounds = 10L
  , bagging_fraction = 1 # no bagging
)

bag_model <- lgb.train(
  params = params
  , data = dtrain
  , nrounds = 100L
  , valids = valids
  , learning_rate = 1
  , early_stopping_rounds = 10L
  , bagging_fraction = 0.1 #bagging
)

nobag_model$best_iter
nobag_model$best_score

bag_model$best_iter
bag_model$best_score

all.equal(nobag_model, bag_model)

@mosscoder Please don't forget to set bagging_freq.

Note: to enable bagging, bagging_freq should be set to a non zero value as well
https://lightgbm.readthedocs.io/en/latest/Parameters.html#bagging_fraction

Thanks for informing me of this. Perhaps updating bagging_freq = 1 to the default would be more intuitive if bagging_fraction < 1?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

raphay3l picture raphay3l  路  3Comments

hlee13 picture hlee13  路  3Comments

ahbon123 picture ahbon123  路  4Comments

JoshuaC3 picture JoshuaC3  路  3Comments

mayer79 picture mayer79  路  3Comments