Lightgbm: Categorical features not accepted

Created on 27 Oct 2017  路  7Comments  路  Source: microsoft/LightGBM

## Environment info
Operating System: Win 10
CPU: i4
C++/Python/R version:
Python 3

Error Message:

ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields

Reproducible examples

On a dataset with object, float64 types, I use the following code:

lgb_train = lgbm.Dataset(X_train, y_train)
lgb_test = lgbm.Dataset(X_test, y_test, reference=lgb_train)

params = {'objective':'multiclass'}
params['metric'] = 'auc'
params['num_class'] = 4

categorical_features =  np.where(X_train.dtypes != np.float)[0]

# train
gbm = lgbm.train(params,
                lgb_train,
                num_boost_round=100,
                valid_sets=[lgb_train, lgb_test], 
                categorical_feature=categorical_features )

Most helpful comment

All 7 comments

object types is not supported yet, you need to convert to int. we might support it later version

The object are my categorical data, so I should change them in "category" as dtypes from pandas or should I LabelEncode them ?

@laurazh you can do either, LabelEncode is more reliable.

Thanks, it works!

But, LabelEncoding can create the misrepresentation of categorical variables. If you have 'red', 'yellow' and 'blue' as three categories and which are label encoded to 1,2 and 3(int64) respectively. This can create a hierarchical ordering that blue is the superior and red the inferior one. How can we deal with that?

From what I understood, tree based algorithm are really robust, and it is ok to give them categorical data that were label encoded. The algo will deal with it. I don't know the math behind though.

Was this page helpful?
0 / 5 - 0 ratings