Lightgbm: Support complex data types in categorical columns of pandas DataFrame

Created on 29 Apr 2019  ยท  7Comments  ยท  Source: microsoft/LightGBM


I am converting one or more columns of float64 into categorical bins to speed up the convergence and force the boundaries of the decision points. Attempting to bin the float columns with pd.cut or pd.qcut

Environment info

Operating System: Windows 10

CPU/GPU model: Intel Core i7

C++/Python/R version: Python 3.6, Anaconda, Jupyter Notebook, pandas 0.24.2

LightGBM version or commit hash: LightGBM 2.2.2

Error message

ValueError: Circular reference detected

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-207-751795a98846> in <module>
      4                            metric         = ['binary_logloss'])
      5 
----> 6 lgbmc.fit(X_train,y_train)
      7 
      8 prob_pred = lgbmc.predict(X_test)

~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
    740                                         verbose=verbose, feature_name=feature_name,
    741                                         categorical_feature=categorical_feature,
--> 742                                         callbacks=callbacks)
    743         return self
    744 

~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
    540                               verbose_eval=verbose, feature_name=feature_name,
    541                               categorical_feature=categorical_feature,
--> 542                               callbacks=callbacks)
    543 
    544         if evals_result:

~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    238         booster.best_score[dataset_name][eval_name] = score
    239     if not keep_training_booster:
--> 240         booster.model_from_string(booster.model_to_string(), False).free_dataset()
    241     return booster
    242 

~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\basic.py in model_to_string(self, num_iteration, start_iteration)
   2064                 ptr_string_buffer))
   2065         ret = string_buffer.value.decode()
-> 2066         ret += _dump_pandas_categorical(self.pandas_categorical)
   2067         return ret
   2068 

~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\basic.py in _dump_pandas_categorical(pandas_categorical, file_name)
    299     pandas_str = ('\npandas_categorical:'
    300                   + json.dumps(pandas_categorical, default=json_default_with_numpy)
--> 301                   + '\n')
    302     if file_name is not None:
    303         with open(file_name, 'a') as f:

~\AppData\Local\conda\conda\envs\py36\lib\json\__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    236         check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    237         separators=separators, default=default, sort_keys=sort_keys,
--> 238         **kw).encode(obj)
    239 
    240 

~\AppData\Local\conda\conda\envs\py36\lib\json\encoder.py in encode(self, o)
    197         # exceptions aren't as detailed.  The list call should be roughly
    198         # equivalent to the PySequence_Fast that ''.join() would do.
--> 199         chunks = self.iterencode(o, _one_shot=True)
    200         if not isinstance(chunks, (list, tuple)):
    201             chunks = list(chunks)

~\AppData\Local\conda\conda\envs\py36\lib\json\encoder.py in iterencode(self, o, _one_shot)
    255                 self.key_separator, self.item_separator, self.sort_keys,
    256                 self.skipkeys, _one_shot)
--> 257         return _iterencode(o, 0)
    258 
    259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

ValueError: Circular reference detected

Reproducible examples

import lightgbm as lgb
from sklearn.model_selection import train_test_split

rows = 100
fcols = 5
ccols = 5
# Let's define some ascii readable names for convenience
fnames = ['Float_'+str(chr(97+n)) for n in range(fcols)]
cnames = ['Cat_'+str(chr(97+n)) for n in range(fcols)]

# The dataset is built by concatenation of the float and the int blocks
dff = pd.DataFrame(np.random.rand(rows,fcols),columns=fnames)
dfc = pd.DataFrame(np.random.randint(0,20,(rows,ccols)),columns=cnames)
df = pd.concat([dfc,dff],axis=1)
# Target column with random output
df['Target'] = (np.random.rand(rows)>0.5).astype(int)

# Conversion into categorical
df[cnames] = df[cnames].astype('category')
df['Float_a'] = pd.cut(x=df['Float_a'],bins=10)

# Dataset split
X = df.drop('Target',axis=1)
y = df['Target'].astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# Model instantiation
lgbmc = lgb.LGBMClassifier(objective      = 'binary',
                           boosting_type  = 'gbdt' ,
                            is_unbalance   = True,
                           metric         = ['binary_logloss'])

lgbmc.fit(X_train,y_train)

Steps to reproduce

  1. Copy code in a Jupyter Notebook cell
  2. Execute
  3. Removing the line df['Float_a'] = pd.cut(x=df['Float_a'],bins=10) there is no error
feature request help wanted

Most helpful comment

As a user my preference would be to combine two of the proposals:

  • convert complex categories into their string representation, possibly
    keeping their ordering properties (e.g with leasing zeros in numbers so
    they keep the same ordering as strings).
  • rise a warning to inform about the conversion explaining a bit of
    background.

Otherwise the minimum would be a better error message with the suggestion
of convert categories into strings.

Thank you!
Marcello

Il Dom 11 Ago 2019, 23:16 Nikita Titov notifications@github.com ha
scritto:

@mallibus https://github.com/mallibus

Actually my thinking was that interval range would work as a category not
as a value.

Yeah, you're right! Interval values are treated as categorical.
Unfortunately, LightGBM supports only simple types of categories, e.g. int,
float, string.

During training LightGBM dumps pandas categories to json. It uses standard
json.dumps() function with our simple numpy array serializer:

https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/basic.py#L264

https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/basic.py#L310-L317

https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/compat.py#L52-L59

In your case, categories are pandas.Interval objects
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Interval.html
which cannot be serialized in such manner.

For instance, the same error can be reproduced by trying to pass the data
where categories are Timestamps.

import numpy as np

import pandas as pd

import lightgbm as lgb

df = pd.DataFrame([pd.to_datetime('01/{0}/2019'.format(i % 12 + 1)) for i in range(100)], columns=['a'])

df['Target'] = (np.random.rand(100) > 0.5).astype(int)

df['a'] = df['a'].astype('category')

X = df.drop('Target', axis=1)

y = df['Target'].astype(int)

lgb_data = lgb.Dataset(X, y)

lgb.train({}, lgb_data)

print(type(lgb_data.pandas_categorical[0][0]))

Removing pd.to_datetime() results in successful training.

I see several ways to fix this issue.

Maybe someone else have other ideas?

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/microsoft/LightGBM/issues/2134?email_source=notifications&email_token=AK3HDRGIHOLEHP4MZ4TFN4TQEB6RPA5CNFSM4HJFHRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4BI3QY#issuecomment-520261059,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AK3HDREBUEHGDILNSFGDNMLQEB6RPANCNFSM4HJFHRCQ
.

All 7 comments

I realized that the Json serializer has some issue with the Interval dtype. Chaging the index of the category to string like df['Float_a'].cat.categories = ["%6.3f-%6.3f"%(x.left,x.right) for x in df['Float_a'].cat.categories] the issue disappears.
This is strange because the pandas json serializer has no problem with Interval types.

I tried pd.cut, and it return the interval ranges, not the binned int values. So we cannot use it for the training.
However, the error message is a little bit confusing.
@StrikerRUS Can we refine the error message when meeting the unexpected pandas dataframes?

Actually my thinking was that interval range would work as a category not
as a value.

Il Gio 1 Ago 2019, 07:38 Guolin Ke notifications@github.com ha scritto:

I tried pd.cut, and it return the interval ranges, not the binned int
values. So we cannot use it for the training.
However, the error message is a little bit confusing.
@StrikerRUS https://github.com/StrikerRUS Can we refine the error
message when meeting the unexpected pandas dataframes?

โ€”
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/microsoft/LightGBM/issues/2134?email_source=notifications&email_token=AK3HDREF7YIKTVWF2ZK5EVLQCJZGJA5CNFSM4HJFHRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3JL3BQ#issuecomment-517127558,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AK3HDRFTKEQSGPIMAEMCV23QCJZGJANCNFSM4HJFHRCQ
.

@mallibus

Actually my thinking was that interval range would work as a category not
as a value.

Yeah, you're right! Interval values are treated as categorical. Unfortunately, LightGBM supports only simple types of categories, e.g. int, float, string.

During training LightGBM dumps pandas categories to json. It uses standard json.dumps() function with our simple numpy array serializer:
https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/basic.py#L264
https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/basic.py#L310-L317
https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/compat.py#L52-L59

In your case, categories are pandas.Interval objects which cannot be serialized in such manner.

For instance, the same error can be reproduced by trying to pass the data where categories are Timestamps.

import numpy as np
import pandas as pd
import lightgbm as lgb

df = pd.DataFrame([pd.to_datetime('01/{0}/2019'.format(i % 12 + 1)) for i in range(100)], columns=['a'])
df['Target'] = (np.random.rand(100) > 0.5).astype(int)
df['a'] = df['a'].astype('category')
X = df.drop('Target', axis=1)
y = df['Target'].astype(int)
lgb_data = lgb.Dataset(X, y)
lgb.train({}, lgb_data)
print(type(lgb_data.pandas_categorical[0][0]))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

Removing pd.to_datetime() results in successful training.

I see several ways to fix this issue.

  • Leave everything as is ๐Ÿ˜„ .
  • Raise more user-friendly error for unsupported types in category (I have no idea what types we should check).
  • Replace complicated objects by their __repr__ string during dumping, so that categories become simple strings. However, it will not allow us to restore original objects during loading and original DataFrame will be modified.
  • Utilize pandas to_json() method. It will bring unwanted dependency to the library and the need to carefully maintain it. BTW, pandas-team are planing a huge refactoring of json support: https://github.com/pandas-dev/pandas/issues/12004#issuecomment-437926069.

Maybe someone else have other ideas?

As a user my preference would be to combine two of the proposals:

  • convert complex categories into their string representation, possibly
    keeping their ordering properties (e.g with leasing zeros in numbers so
    they keep the same ordering as strings).
  • rise a warning to inform about the conversion explaining a bit of
    background.

Otherwise the minimum would be a better error message with the suggestion
of convert categories into strings.

Thank you!
Marcello

Il Dom 11 Ago 2019, 23:16 Nikita Titov notifications@github.com ha
scritto:

@mallibus https://github.com/mallibus

Actually my thinking was that interval range would work as a category not
as a value.

Yeah, you're right! Interval values are treated as categorical.
Unfortunately, LightGBM supports only simple types of categories, e.g. int,
float, string.

During training LightGBM dumps pandas categories to json. It uses standard
json.dumps() function with our simple numpy array serializer:

https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/basic.py#L264

https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/basic.py#L310-L317

https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/compat.py#L52-L59

In your case, categories are pandas.Interval objects
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Interval.html
which cannot be serialized in such manner.

For instance, the same error can be reproduced by trying to pass the data
where categories are Timestamps.

import numpy as np

import pandas as pd

import lightgbm as lgb

df = pd.DataFrame([pd.to_datetime('01/{0}/2019'.format(i % 12 + 1)) for i in range(100)], columns=['a'])

df['Target'] = (np.random.rand(100) > 0.5).astype(int)

df['a'] = df['a'].astype('category')

X = df.drop('Target', axis=1)

y = df['Target'].astype(int)

lgb_data = lgb.Dataset(X, y)

lgb.train({}, lgb_data)

print(type(lgb_data.pandas_categorical[0][0]))

Removing pd.to_datetime() results in successful training.

I see several ways to fix this issue.

Maybe someone else have other ideas?

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/microsoft/LightGBM/issues/2134?email_source=notifications&email_token=AK3HDRGIHOLEHP4MZ4TFN4TQEB6RPA5CNFSM4HJFHRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4BI3QY#issuecomment-520261059,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AK3HDREBUEHGDILNSFGDNMLQEB6RPANCNFSM4HJFHRCQ
.

@mallibus Thanks a lot for your feedback!

Closing in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open (or post a comment if you are not a topic starter) this issue if you are actively working on implementing this feature.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

BuGTEa picture BuGTEa  ยท  3Comments

JoshuaC3 picture JoshuaC3  ยท  3Comments

mayer79 picture mayer79  ยท  3Comments

NicolasHug picture NicolasHug  ยท  3Comments

raphay3l picture raphay3l  ยท  3Comments