## Environment info
Operating System: Win 10
CPU: i4
C++/Python/R version:
Python 3
ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields
On a dataset with object, float64 types, I use the following code:
lgb_train = lgbm.Dataset(X_train, y_train)
lgb_test = lgbm.Dataset(X_test, y_test, reference=lgb_train)
params = {'objective':'multiclass'}
params['metric'] = 'auc'
params['num_class'] = 4
categorical_features = np.where(X_train.dtypes != np.float)[0]
# train
gbm = lgbm.train(params,
lgb_train,
num_boost_round=100,
valid_sets=[lgb_train, lgb_test],
categorical_feature=categorical_features )
object types is not supported yet, you need to convert to int. we might support it later version
The object are my categorical data, so I should change them in "category" as dtypes from pandas or should I LabelEncode them ?
@laurazh You could do both things. Here is the working example:
https://github.com/Microsoft/LightGBM/blob/d94ec89b68b46a1af484773fbd771b71f312a754/tests/python_package_test/test_engine.py#L481-L500
@laurazh you can do either, LabelEncode is more reliable.
Thanks, it works!
But, LabelEncoding can create the misrepresentation of categorical variables. If you have 'red', 'yellow' and 'blue' as three categories and which are label encoded to 1,2 and 3(int64) respectively. This can create a hierarchical ordering that blue is the superior and red the inferior one. How can we deal with that?
From what I understood, tree based algorithm are really robust, and it is ok to give them categorical data that were label encoded. The algo will deal with it. I don't know the math behind though.
Most helpful comment
@laurazh You could do both things. Here is the working example:
https://github.com/Microsoft/LightGBM/blob/d94ec89b68b46a1af484773fbd771b71f312a754/tests/python_package_test/test_engine.py#L481-L500