Lightgbm: the lightGBM cannot deal with categorical labels and features and despite the support and tried all solutions

Created on 22 Aug 2018 · 10Comments · Source: microsoft/LightGBM

I have a data set of one dependent categorical and 7 categorical features with 12987 samples

I tried one hot encoding and it worked by it is not dealing with these large categories. In addition, I want to draw a decision tree so doctors can understand it; in case of hot encoding, the classes are coded and it is hard to read

here is my code
import lightgbm as lgb
import pandas as pd
import numpy as np
import graphviz as graph
from sklearn.cross_validation import [train_test_split]
df = pd.read_csv('data2.csv')
y = df.Pathology
X = df.drop('Pathology', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.8)

create dataset for lightgbm

lgb_train = lgb.Dataset(X_train, y_train)
lgb_test = lgb.Dataset(X_test, y_test, reference=lgb_train)

specify your configurations as a dict

params = {'num_leaves': 5,'metric': ('l1', 'l2'),'verbose': 0}

evals_result = {} # to record eval results for plotting

print('Start training...')

train

gbm = lgb.train(params,lgb_train,categorical_feature=['Marital','site','Laterality','Sex','Age','Race'])

I get this
gbm = lgb.train(params,lgb_train,categorical_feature=['Marital','site','Laterality','Sex','Age','Race'])
C:\Users\Sara\Anaconda2\lib\site-packages\lightgbm\basic.py:1042: UserWarning: categorical_feature in Dataset is overridden. New categorical_feature is ['Age', 'Laterality', 'Marital', 'Race', 'Sex', 'site']
warnings.warn('categorical_feature in Dataset is overridden. New categorical_feature is {}'.format(sorted(list(categorical_feature))))
Traceback (most recent call last):

File "", line 1, in
gbm = lgb.train(params,lgb_train,categorical_feature=['Marital','site','Laterality','Sex','Age','Race'])

File "C:\Anaconda2\lib\site-packages\lightgbm\engine.py", line 183, in train
booster = Booster(params=params, train_set=train_set)

File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 1307, in __init__
train_set.construct().handle,

File "C:\Users\Sara\Anaconda2\lib\site-packages\lightgbm\basic.py", line 860, in construct
categorical_feature=self.categorical_feature, params=self.params)

File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 652, in _lazy_init
data, feature_name, categorical_feature, self.pandas_categorical = _data_from_pandas(data, feature_name, categorical_feature, self.pandas_categorical)

File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 273, in _data_from_pandas
raise ValueError(msg + ', '.join(bad_fields))

ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields Age, Sex, Laterality, Race , Marital, site

Source

SaraMorsy

👍3

Most helpful comment

Hi @SaraMorsy !

Your data contains string(?) values. You should set categorical columns in pandas to treat them properly. After that there is no need to pass them again in categorical_feature param.

Please see this code how to set categorical columns in pandas:
https://github.com/Microsoft/LightGBM/blob/3400e3899d1c45997337bdbd0c8a2f729e805cbc/tests/python_package_test/test_engine.py#L487-L489
https://github.com/Microsoft/LightGBM/blob/3400e3899d1c45997337bdbd0c8a2f729e805cbc/tests/python_package_test/test_engine.py#L477-L496

StrikerRUS on 22 Aug 2018

👍13 ❤2

All 10 comments

Hi @SaraMorsy !

Your data contains string(?) values. You should set categorical columns in pandas to treat them properly. After that there is no need to pass them again in categorical_feature param.

StrikerRUS on 22 Aug 2018

👍13 ❤2

Thanks but it only solved the categorical features but could not handle the categorical y

here is my code
df = pd.read_csv('data2.csv')
for col in ['Sex', 'site','Race', 'Laterality','Marital','Pathology']:
df[col] = df[col].astype('category')
y = df.Pathology
X = df.drop('Pathology', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.8)

specify your configurations as a dict

params = { 'objective': 'multiclassova', 'metric': 'multi_logloss', 'verbose': -1, 'num_class':10}

evals_result = {} # to record eval results for plotting

print('Start training...')

create dataset for lightgbm

lgb_train = lgb.Dataset(X_train, y_train)
lgb_test = lgb.Dataset(X_test, y_test, reference=lgb_train)

train

gbm = lgb.train(params,lgb_train)

I get this error

gbm = lgb.train(params,lgb_train)
C:\Users\Anaconda2\lib\site-packages\lightgbm\basic.py:685: UserWarning: categorical_feature in param dict is overridden.
warnings.warn('categorical_feature in param dict is overridden.')
Traceback (most recent call last):

File "", line 1, in
gbm = lgb.train(params,lgb_train)

File "C:\Anaconda2\lib\site-packages\lightgbm\engine.py", line 183, in train
booster = Booster(params=params, train_set=train_set)

File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 1307, in __init__
train_set.construct().handle,

File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 860, in construct
categorical_feature=self.categorical_feature, params=self.params)

File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 722, in _lazy_init
self.set_label(label)

File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 1110, in set_label
label = list_to_1d_numpy(label, name='label')

File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 84, in list_to_1d_numpy
return data.values.astype(dtype)

File "C:\Anaconda2\lib\site-packages\pandas\core\categorical.py", line 455, in astype
return np.array(self, dtype=dtype, copy=copy)

File "C:\Anaconda2\lib\site-packages\pandas\core\categorical.py", line 1210, in __array__
return np.asarray(ret, dtype)

File "C:\Anaconda2\lib\site-packages\numpy\core\numeric.py", line 492, in asarray
return array(a, dtype, copy=False, order=order)

ValueError: could not convert string to float: Glioblastoma

SaraMorsy on 22 Aug 2018

you should convert it to int for multi-class label.

guolinke on 22 Aug 2018

👍1

@SaraMorsy
Since you're using train_test_split, I assume that you're familiar with scikit-learn (or at least it's installed on your computer). So, you can use this:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
http://scikit-learn.org/stable/modules/preprocessing_targets.html

Alternatively, you can switch to LGBMClassifier where we have built-in label encoding routine.

StrikerRUS on 22 Aug 2018

👍1

As I stated before, I used label encoding and one hot encoding but the tree plot will be like this

the medical community will not understand leaf index:5, I want these leaves to be named by the class (e.g Glioblastoma). that's why I want to use the category as it is the only method to preserve a class name
I will try to use LGBM classifier

SaraMorsy on 22 Aug 2018

👍1

@SaraMorsy I'm apologizing for the delay!
leaf_index field represents indices, as it's stated in the name. I mean, they are always 0,1,2,...
Name of the class can be restored from the leaf_value field. Please read #1360, #1675, #1552 for better understanding.

StrikerRUS on 21 Nov 2018

I have to agree with @SaraMorsy here.

LightGBMs support for Pandas category types is great. It lets me interact with the original categories in plain text while integers are being used under the hood in order to build a tree. It's the best of both worlds. Plain text for the intuition. Integers for the math.

It seems reasonable to me that when plotting the tree, one expects to find the original category names. It feels inconsistent to show the underlying integers because the entire purpose of the category type is to abstract them away from the user.

I would also like to see the actual category names in situations where something is being explained about the model. Otherwise, what's the benefit of supporting the category type?

pietz on 19 Dec 2018

@pietz Here are some benefits:

You don't have to explicitly tell lightGBM that a feature is categorical, and
You get support for categorical features (this is still lacking scikit-learn)

See:
"however, scikit-learn implementation does not support categorical variables for now." from https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart)

Section "Categorical Feature Support" in https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html

snow-abstraction on 18 Jan 2019

Alternatively, you can switch to LGBMClassifier where we have built-in label encoding routine.
Can you share example how to do it
Since I try to use LGBMClassifier with categorical target with values yes and no
But get error message