Lightgbm: Categorical features not accepted

Created on 27 Oct 2017 · 7Comments · Source: microsoft/LightGBM

## Environment info
Operating System: Win 10
CPU: i4
C++/Python/R version:
Python 3

Error Message:

ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields

Reproducible examples

On a dataset with object, float64 types, I use the following code:

lgb_train = lgbm.Dataset(X_train, y_train)
lgb_test = lgbm.Dataset(X_test, y_test, reference=lgb_train)

params = {'objective':'multiclass'}
params['metric'] = 'auc'
params['num_class'] = 4

categorical_features =  np.where(X_train.dtypes != np.float)[0]

# train
gbm = lgbm.train(params,
                lgb_train,
                num_boost_round=100,
                valid_sets=[lgb_train, lgb_test], 
                categorical_feature=categorical_features )

Source

laurazh

Most helpful comment

@laurazh You could do both things. Here is the working example:
https://github.com/Microsoft/LightGBM/blob/d94ec89b68b46a1af484773fbd771b71f312a754/tests/python_package_test/test_engine.py#L481-L500

StrikerRUS on 30 Oct 2017

👍3

All 7 comments

object types is not supported yet, you need to convert to int. we might support it later version

wxchan on 27 Oct 2017

The object are my categorical data, so I should change them in "category" as dtypes from pandas or should I LabelEncode them ?

laurazh on 30 Oct 2017

StrikerRUS on 30 Oct 2017

👍3

@laurazh you can do either, LabelEncode is more reliable.

wxchan on 30 Oct 2017

Thanks, it works!

laurazh on 30 Oct 2017

But, LabelEncoding can create the misrepresentation of categorical variables. If you have 'red', 'yellow' and 'blue' as three categories and which are label encoded to 1,2 and 3(int64) respectively. This can create a hierarchical ordering that blue is the superior and red the inferior one. How can we deal with that?

spaceVStab on 21 Dec 2017

👍1

From what I understood, tree based algorithm are really robust, and it is ok to give them categorical data that were label encoded. The algo will deal with it. I don't know the math behind though.

laurazh on 21 Dec 2017

Was this page helpful?

0 / 5 - 0 ratings