Lightgbm: categorical_feature seems does not work?

Created on 7 Sep 2017  路  19Comments  路  Source: microsoft/LightGBM

My code:

`cate_cols = [0,2,4,6,7,8,9,10]

    d_cv = lightgbm.Dataset(X.values, label=Y.values, max_bin= 127, categorical_feature= cate_cols)

    self._model = lightgbm.train(self._params, d_cv, self._iter)

    print(self._params)

    print(d_cv.params)`

And the output of this partial code:

{'objective': 'regression_l1', 'lambda_l1': 0.01, 'verbose': 1, 'min_hessian': 0.01, 'learning_rate': 0.01, 'max_bin': 127, 'boosting_type': 'gbdt', 'bagging_freq': 12, 'bagging_fraction': 0.85, 'sub_feature': 0.8, 'num_leaves': 16, 'min_data': 100}

{'objective': 'regression_l1', 'lambda_l1': 0.01, 'verbose': 1, 'min_hessian': 0.01, 'learning_rate': 0.01, 'max_bin': 127, 'boosting_type': 'gbdt', 'bagging_freq': 12, 'bagging_fraction': 0.85, 'sub_feature': 0.8, 'num_leaves': 16, 'min_data': 100}

It seems that my setting for parameter categorical_feature did not work yet, as another setting for parameter max_bin already worked well.

The expected output should be like:

{'objective': 'regression_l1', 'lambda_l1': 0.01, 'verbose': 1, 'min_hessian': 0.01, 'learning_rate': 0.01, 'max_bin': 127, 'boosting_type': 'gbdt', 'bagging_freq': 12, 'bagging_fraction': 0.85, 'sub_feature': 0.8, 'num_leaves': 16, 'min_data': 100, 'categorical_feature': [0,2,4,6,7,8,9,10]}

{'objective': 'regression_l1', 'lambda_l1': 0.01, 'verbose': 1, 'min_hessian': 0.01, 'learning_rate': 0.01, 'max_bin': 127, 'boosting_type': 'gbdt', 'bagging_freq': 12, 'bagging_fraction': 0.85, 'sub_feature': 0.8, 'num_leaves': 16, 'min_data': 100, 'categorical_feature': [0,2,4,6,7,8,9,10]}

in my sight. Am I right?

I also checked the difference of performance metrics between condition of _categorical_feature= cate_cols_ and that of _categorical_feature= []_ . The difference is little, did not change anything.

Thanks-)

Most helpful comment

@wxchan hello! Has the bug been fixed? I keep getting the warnings
UserWarning: Using categorical_feature in Dataset. warnings.warn('Using categorical_feature in Dataset.')
UserWarning: categorical_feature in param dict is overridden. warnings.warn('categorical_feature in param dict is overridden.')
and I'm worried that the model is not seeing my categorical variables at all.

All 19 comments

@wxchan

categorical_feature seems passing to c++ successfully but not save to Dataset.params().
@guolinke is there any quick way to get model params now?

@wxchan sorry, I didn't get it

@PayneJoe You might already know, lgb.train(categorical_feature=cate_cols) might work well.

Much appreciated for your tips, @marugari

I did not notice that parameter categorical_feature also emerged in train method. I'll try it thoroughly.

The performances are different between using categorical_feature of dataset and using categorical_feature of train method?

@wxchan How did it happen like this? And how can we use this parameter? Thanks-)

@PayneJoe eh..it's a question. I think they should perform the same.

I think lgb.Dataset(categorical_feature=) didn't work in last Dec.
https://github.com/marugari/Notebooks/blob/master/LightGBM.ipynb

If the parameter is passed successfully, thresholds become integer.
(The following is heavy!)
https://raw.githubusercontent.com/marugari/Notebooks/master/output/lgb_cat.txt
https://raw.githubusercontent.com/marugari/Notebooks/master/output/lgb.txt

@marugari It's a known bug. I just tested, it should be fixed.

@wxchan updates of this ?

Sorry I am very busy last month. I can confirm there is some bug for categorical_feature. Both ways (set in dataset & set in train()) seem not working. I will try to fix it in this week.

@wxchan
is this fixed ?

Seems to be working for me in 2.010 (at least in Windows+R).

I think current python version is working too, the reason is same as I state here https://github.com/Microsoft/LightGBM/issues/893#issuecomment-328267584. We can give user a better way to confirm the categorical feature is set but maybe not used for split.

I find the param categorical feature it's useless as my code is like this:
train_set = lgb.Dataset(X_train_temp,y_train_temp,params=None,categorical_feature=cat_features)
valid_sets = lgb.Dataset(X_val,y_val,reference=train_set,params=None,categorical_feature=cat_features)

lgb.train(categorical_feature=cate_cols)

And the program result has no difference when add the categorical_feature params,the categorical_feature columns are numberic data.

@wxchan hello! Has the bug been fixed? I keep getting the warnings
UserWarning: Using categorical_feature in Dataset. warnings.warn('Using categorical_feature in Dataset.')
UserWarning: categorical_feature in param dict is overridden. warnings.warn('categorical_feature in param dict is overridden.')
and I'm worried that the model is not seeing my categorical variables at all.

Any updates on this?

@arrayslayer I think this was fixed long ago.

Was this page helpful?
0 / 5 - 0 ratings