I have a data set of one dependent categorical and 7 categorical features with 12987 samples
here is my code
import lightgbm as lgb
import pandas as pd
import numpy as np
import graphviz as graph
from sklearn.cross_validation import [train_test_split]
df = pd.read_csv('data2.csv')
y = df.Pathology
X = df.drop('Pathology', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.8)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_test = lgb.Dataset(X_test, y_test, reference=lgb_train)
params = {'num_leaves': 5,'metric': ('l1', 'l2'),'verbose': 0}
evals_result = {} # to record eval results for plotting
print('Start training...')
gbm = lgb.train(params,lgb_train,categorical_feature=['Marital','site','Laterality','Sex','Age','Race'])
I get this
gbm = lgb.train(params,lgb_train,categorical_feature=['Marital','site','Laterality','Sex','Age','Race'])
C:\Users\Sara\Anaconda2\lib\site-packages\lightgbm\basic.py:1042: UserWarning: categorical_feature in Dataset is overridden. New categorical_feature is ['Age', 'Laterality', 'Marital', 'Race', 'Sex', 'site']
warnings.warn('categorical_feature in Dataset is overridden. New categorical_feature is {}'.format(sorted(list(categorical_feature))))
Traceback (most recent call last):
File "
gbm = lgb.train(params,lgb_train,categorical_feature=['Marital','site','Laterality','Sex','Age','Race'])
File "C:\Anaconda2\lib\site-packages\lightgbm\engine.py", line 183, in train
booster = Booster(params=params, train_set=train_set)
File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 1307, in __init__
train_set.construct().handle,
File "C:\Users\Sara\Anaconda2\lib\site-packages\lightgbm\basic.py", line 860, in construct
categorical_feature=self.categorical_feature, params=self.params)
File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 652, in _lazy_init
data, feature_name, categorical_feature, self.pandas_categorical = _data_from_pandas(data, feature_name, categorical_feature, self.pandas_categorical)
File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 273, in _data_from_pandas
raise ValueError(msg + ', '.join(bad_fields))
ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields Age, Sex, Laterality, Race , Marital, site
Hi @SaraMorsy !
Your data contains string(?) values. You should set categorical columns in pandas to treat them properly. After that there is no need to pass them again in categorical_feature param.
Please see this code how to set categorical columns in pandas:
https://github.com/Microsoft/LightGBM/blob/3400e3899d1c45997337bdbd0c8a2f729e805cbc/tests/python_package_test/test_engine.py#L487-L489
https://github.com/Microsoft/LightGBM/blob/3400e3899d1c45997337bdbd0c8a2f729e805cbc/tests/python_package_test/test_engine.py#L477-L496
Thanks but it only solved the categorical features but could not handle the categorical y
here is my code
df = pd.read_csv('data2.csv')
for col in ['Sex', 'site','Race', 'Laterality','Marital','Pathology']:
df[col] = df[col].astype('category')
y = df.Pathology
X = df.drop('Pathology', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.8)
params = { 'objective': 'multiclassova', 'metric': 'multi_logloss', 'verbose': -1, 'num_class':10}
evals_result = {} # to record eval results for plotting
print('Start training...')
lgb_train = lgb.Dataset(X_train, y_train)
lgb_test = lgb.Dataset(X_test, y_test, reference=lgb_train)
gbm = lgb.train(params,lgb_train)
gbm = lgb.train(params,lgb_train)
C:\Users\Anaconda2\lib\site-packages\lightgbm\basic.py:685: UserWarning: categorical_feature in param dict is overridden.
warnings.warn('categorical_feature in param dict is overridden.')
Traceback (most recent call last):
File "
gbm = lgb.train(params,lgb_train)
File "C:\Anaconda2\lib\site-packages\lightgbm\engine.py", line 183, in train
booster = Booster(params=params, train_set=train_set)
File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 1307, in __init__
train_set.construct().handle,
File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 860, in construct
categorical_feature=self.categorical_feature, params=self.params)
File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 722, in _lazy_init
self.set_label(label)
File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 1110, in set_label
label = list_to_1d_numpy(label, name='label')
File "C:\Anaconda2\lib\site-packages\lightgbm\basic.py", line 84, in list_to_1d_numpy
return data.values.astype(dtype)
File "C:\Anaconda2\lib\site-packages\pandas\core\categorical.py", line 455, in astype
return np.array(self, dtype=dtype, copy=copy)
File "C:\Anaconda2\lib\site-packages\pandas\core\categorical.py", line 1210, in __array__
return np.asarray(ret, dtype)
File "C:\Anaconda2\lib\site-packages\numpy\core\numeric.py", line 492, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: Glioblastoma
you should convert it to int for multi-class label.
@SaraMorsy
Since you're using train_test_split, I assume that you're familiar with scikit-learn (or at least it's installed on your computer). So, you can use this:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
http://scikit-learn.org/stable/modules/preprocessing_targets.html
Alternatively, you can switch to LGBMClassifier where we have built-in label encoding routine.
As I stated before, I used label encoding and one hot encoding but the tree plot will be like this

the medical community will not understand leaf index:5, I want these leaves to be named by the class (e.g Glioblastoma). that's why I want to use the category as it is the only method to preserve a class name
I will try to use LGBM classifier
@SaraMorsy I'm apologizing for the delay!
leaf_index field represents indices, as it's stated in the name. I mean, they are always 0,1,2,...
Name of the class can be restored from the leaf_value field. Please read #1360, #1675, #1552 for better understanding.
I have to agree with @SaraMorsy here.
LightGBMs support for Pandas category types is great. It lets me interact with the original categories in plain text while integers are being used under the hood in order to build a tree. It's the best of both worlds. Plain text for the intuition. Integers for the math.
It seems reasonable to me that when plotting the tree, one expects to find the original category names. It feels inconsistent to show the underlying integers because the entire purpose of the category type is to abstract them away from the user.
I would also like to see the actual category names in situations where something is being explained about the model. Otherwise, what's the benefit of supporting the category type?
@pietz Here are some benefits:
See:
"however, scikit-learn implementation does not support categorical variables for now." from https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart)
Section "Categorical Feature Support" in https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html
Alternatively, you can switch to LGBMClassifier where we have built-in label encoding routine.
Can you share example how to do it
Since I try to use LGBMClassifier with categorical target with values yes and no
But get error message
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Hi @SaraMorsy !
Your data contains string(?) values. You should set categorical columns in pandas to treat them properly. After that there is no need to pass them again in
categorical_featureparam.Please see this code how to set categorical columns in pandas:
https://github.com/Microsoft/LightGBM/blob/3400e3899d1c45997337bdbd0c8a2f729e805cbc/tests/python_package_test/test_engine.py#L487-L489
https://github.com/Microsoft/LightGBM/blob/3400e3899d1c45997337bdbd0c8a2f729e805cbc/tests/python_package_test/test_engine.py#L477-L496