Which parameter and which range of values would you consider most useful for hyper parameter optimization of light gbm during an bayesian optimization process for a highly imbalanced classification problem?
parameters denotes the search grid and static_parametersparameters which are statically applied during the search but not optimized for.
parameters = [
dict(name="max_bin", type="int", bounds=dict(min=20, max=20000)),
dict(name="learning_rate", type="double", bounds=dict(min=0.001, max=0.3)),
dict(name="num_leaves", type="int", bounds=dict(min=100, max=4095)),
# dict(name="num_leaves", type="int", bounds=dict(min=100, max=45000)),
dict(name="scale_pos_weight", type="double", bounds=dict(min=0.01, max=2000.0)),
dict(name="n_estimators", type="int", bounds=dict(min=10, max=10000)),
dict(name="min_child_weight", type="int", bounds=dict(min=1, max=2000)),
dict(name="subsample", type="double", bounds=dict(min=0.4, max=1)),
dict(name="bagging_fraction", type="double", bounds=dict(min=0.3, max=1)),
dict(name="max_depth", type="int", bounds=dict(min=2, max=50)),
]
static_parameters = {'boosting_type': 'dart', 'reg_alpha': 0, 'reg_lambda': 2, 'is_unbalance': True,
'min_split_gain': 0, 'min_child_samples': 10, 'colsample_bytree': 0.8, 'subsample_freq': 3,
'subsample_for_bin': 50000,
'histogram_pool_size': detect_available_memory_for_histogram_cache()}
For heavily unbalanced datasets such as 1:10000:
Never tune these parameters unless you have an explicit requirement to tune them:
@laurae2 thanks. When searching in the latest python documentation for the sklearn. API some of the parameters no longer seem to be present. This applies to is_unbalance and scale_pos_weight. Are these still accessible i.e. via kwargs?
Also, I did not find a thing in the documentation that both of these should be applied XOR i.e. in the past I had the best results with:
is_unbalance : True, scale_pos_weight: 0.1. Note, int the parameters.md they are still present - just not in the sklearn API.
USE A CUSTOM METRIC (to reflect reality without weighting, otherwise you have weights inside your metric with premade metrics like xgboost)
please could you explain this one a bit more. I thought that calculating something like the f_beta score based on the classification results should be sufficient here.
Also the hyper parameter tuning guide suggests looking at min_data_in_leaf. What about the other similar paraders of: min_child_weight, min_sum_hessian_in_leaf
@geoHeil never mind for the last part (about the metric), it seems it was fixed in LightGBM (the bug is still present in xgboost). The weights are not applied for the metric computation, which is very great to see.
If I remember, scale_pos_weight and is_unbalance are mixing themselves. They should still be accessible.
See here: https://github.com/Microsoft/LightGBM/blob/master/src/objective/binary_objective.hpp#L73-L82
@wxchan question below:
@geoHeil: When searching in the latest python documentation for the sklearn. API some of the parameters no longer seem to be present. This applies to is_unbalance and scale_pos_weight. Are these still accessible i.e. via kwargs?
@Laurae2 indeed, in the c code they are accessible - but https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/sklearn.py does not contain any references to these parameters.
@geoHeil You can check here for the whole list of parameters: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.md
@Laurae2 indeed - as written above the parameters are documented there and are used in the C code. I just find it strange that the python code i.e. in particular the scikit-learn API is not using any of these parameters.
May I clarify one more parameter: the hyper parameter tuning guide suggests looking at min_data_in_leaf. What about the other similar paraders of: min_child_weight, min_sum_hessian_in_leaf
@geoHeil I would just use min_child_weight unless it is really needed to expand the optimization dimensions (or if it is required to get the extra better digits on the performance metric).
They are still accessible via kwargs. There is a test case here https://github.com/Microsoft/LightGBM/blob/master/tests/python_package_test/test_sklearn.py#L101. drop_rate is set via kwargs.
I remove some parameters only shown in some of boosting_types or some of tasks.
if neg/pos is about 45-65, in this case, should we set scale_pos_weight to 1 or neg/pos ?
when i use BayesianOptimization(bayes tuning) to tune xgboost, if i set scale_pos_weight to neg/pos the ks on valid dataset is about 18, but if i set scale_pos_weight to 1, the ks on valid dataset up to 38 ... i don't know why ? can somebody help to explain ..
Thank you @Laurae2, for providing the guidelines for training models on Imbalanced dataset.
I am dealing with the multi-class problem of this kind. As a starting point, I am following the values you provided.
Do these also extend to the multi-class problem?
For heavily unbalanced datasets such as 1:10000:
* max_bin: keep it only for memory pressure, not to tune (otherwise overfitting) * learning rate: keep it only for training speed, not to tune (otherwise overfitting) * n_estimators: must be infinite (like 9999999) and use early stopping to auto-tune (otherwise overfitting) * num_leaves: [7, 4095] * max_depth: [2, 63] and infinite (I personally saw metric performance increases with such 63 depth with small number of leaves on sparse unbalanced datasets) * scale_pos_weight: [1, 10000] (if over 10000, something might be wrong because I never saw it that good after 5000) * min_child_weight: [0.01, (sample size / 1000)] if you are using logloss (think about the hessian possible value range before putting "sample size / 1000", it is dataset-dependent and loss-dependent) * subsample: [0.4, 1] * bagging_fraction: only 1, keep as is (otherwise overfitting) * colsample_bytree: [0.4, 1] * is_unbalance: false (make your own weighting with scale_pos_weight) * **USE A CUSTOM METRIC** (to reflect reality without weighting, otherwise you have weights inside your metric with premade metrics like xgboost)Never tune these parameters unless you have an explicit requirement to tune them:
* Learning rate (lower means longer to train but more accurate, higher means smaller to train but less accurate) * Number of boosting iterations (automatically tuned with early stopping and learning rate) * Maximum number of bins (RAM dependent)
@salilmishra23 yes, but avoid the following hyperparameters for multiclass because they are not supposed to be relevant for multiclass:
And use the trick found here (boost_from_average = False):
Thanks, @Laurae2 . This is the best guidelines I have found on this topic. Could you clarify if this is a typo:
* bagging_fraction: only 1, keep as is (otherwise overfitting)
It contradicts what was said earlier:
* subsample: [0.4, 1]
@panshi-wang it should be bagging_freq actually. Fixed the typo.
Most helpful comment
For heavily unbalanced datasets such as 1:10000:
Never tune these parameters unless you have an explicit requirement to tune them: