I am converting one or more columns of float64 into categorical bins to speed up the convergence and force the boundaries of the decision points. Attempting to bin the float columns with pd.cut or pd.qcut
Operating System: Windows 10
CPU/GPU model: Intel Core i7
C++/Python/R version: Python 3.6, Anaconda, Jupyter Notebook, pandas 0.24.2
LightGBM version or commit hash: LightGBM 2.2.2
ValueError: Circular reference detected
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-207-751795a98846> in <module>
4 metric = ['binary_logloss'])
5
----> 6 lgbmc.fit(X_train,y_train)
7
8 prob_pred = lgbmc.predict(X_test)
~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
740 verbose=verbose, feature_name=feature_name,
741 categorical_feature=categorical_feature,
--> 742 callbacks=callbacks)
743 return self
744
~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
540 verbose_eval=verbose, feature_name=feature_name,
541 categorical_feature=categorical_feature,
--> 542 callbacks=callbacks)
543
544 if evals_result:
~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
238 booster.best_score[dataset_name][eval_name] = score
239 if not keep_training_booster:
--> 240 booster.model_from_string(booster.model_to_string(), False).free_dataset()
241 return booster
242
~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\basic.py in model_to_string(self, num_iteration, start_iteration)
2064 ptr_string_buffer))
2065 ret = string_buffer.value.decode()
-> 2066 ret += _dump_pandas_categorical(self.pandas_categorical)
2067 return ret
2068
~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\basic.py in _dump_pandas_categorical(pandas_categorical, file_name)
299 pandas_str = ('\npandas_categorical:'
300 + json.dumps(pandas_categorical, default=json_default_with_numpy)
--> 301 + '\n')
302 if file_name is not None:
303 with open(file_name, 'a') as f:
~\AppData\Local\conda\conda\envs\py36\lib\json\__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
236 check_circular=check_circular, allow_nan=allow_nan, indent=indent,
237 separators=separators, default=default, sort_keys=sort_keys,
--> 238 **kw).encode(obj)
239
240
~\AppData\Local\conda\conda\envs\py36\lib\json\encoder.py in encode(self, o)
197 # exceptions aren't as detailed. The list call should be roughly
198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
200 if not isinstance(chunks, (list, tuple)):
201 chunks = list(chunks)
~\AppData\Local\conda\conda\envs\py36\lib\json\encoder.py in iterencode(self, o, _one_shot)
255 self.key_separator, self.item_separator, self.sort_keys,
256 self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)
258
259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,
ValueError: Circular reference detected
import lightgbm as lgb
from sklearn.model_selection import train_test_split
rows = 100
fcols = 5
ccols = 5
# Let's define some ascii readable names for convenience
fnames = ['Float_'+str(chr(97+n)) for n in range(fcols)]
cnames = ['Cat_'+str(chr(97+n)) for n in range(fcols)]
# The dataset is built by concatenation of the float and the int blocks
dff = pd.DataFrame(np.random.rand(rows,fcols),columns=fnames)
dfc = pd.DataFrame(np.random.randint(0,20,(rows,ccols)),columns=cnames)
df = pd.concat([dfc,dff],axis=1)
# Target column with random output
df['Target'] = (np.random.rand(rows)>0.5).astype(int)
# Conversion into categorical
df[cnames] = df[cnames].astype('category')
df['Float_a'] = pd.cut(x=df['Float_a'],bins=10)
# Dataset split
X = df.drop('Target',axis=1)
y = df['Target'].astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# Model instantiation
lgbmc = lgb.LGBMClassifier(objective = 'binary',
boosting_type = 'gbdt' ,
is_unbalance = True,
metric = ['binary_logloss'])
lgbmc.fit(X_train,y_train)
df['Float_a'] = pd.cut(x=df['Float_a'],bins=10) there is no errorI realized that the Json serializer has some issue with the Interval dtype. Chaging the index of the category to string like df['Float_a'].cat.categories = ["%6.3f-%6.3f"%(x.left,x.right) for x in df['Float_a'].cat.categories] the issue disappears.
This is strange because the pandas json serializer has no problem with Interval types.
I tried pd.cut, and it return the interval ranges, not the binned int values. So we cannot use it for the training.
However, the error message is a little bit confusing.
@StrikerRUS Can we refine the error message when meeting the unexpected pandas dataframes?
Actually my thinking was that interval range would work as a category not
as a value.
Il Gio 1 Ago 2019, 07:38 Guolin Ke notifications@github.com ha scritto:
I tried pd.cut, and it return the interval ranges, not the binned int
values. So we cannot use it for the training.
However, the error message is a little bit confusing.
@StrikerRUS https://github.com/StrikerRUS Can we refine the error
message when meeting the unexpected pandas dataframes?โ
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/microsoft/LightGBM/issues/2134?email_source=notifications&email_token=AK3HDREF7YIKTVWF2ZK5EVLQCJZGJA5CNFSM4HJFHRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3JL3BQ#issuecomment-517127558,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AK3HDRFTKEQSGPIMAEMCV23QCJZGJANCNFSM4HJFHRCQ
.
@mallibus
Actually my thinking was that interval range would work as a category not
as a value.
Yeah, you're right! Interval values are treated as categorical. Unfortunately, LightGBM supports only simple types of categories, e.g. int, float, string.
During training LightGBM dumps pandas categories to json. It uses standard json.dumps() function with our simple numpy array serializer:
https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/basic.py#L264
https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/basic.py#L310-L317
https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/compat.py#L52-L59
In your case, categories are pandas.Interval objects which cannot be serialized in such manner.
For instance, the same error can be reproduced by trying to pass the data where categories are Timestamps.
import numpy as np
import pandas as pd
import lightgbm as lgb
df = pd.DataFrame([pd.to_datetime('01/{0}/2019'.format(i % 12 + 1)) for i in range(100)], columns=['a'])
df['Target'] = (np.random.rand(100) > 0.5).astype(int)
df['a'] = df['a'].astype('category')
X = df.drop('Target', axis=1)
y = df['Target'].astype(int)
lgb_data = lgb.Dataset(X, y)
lgb.train({}, lgb_data)
print(type(lgb_data.pandas_categorical[0][0]))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
Removing pd.to_datetime() results in successful training.
I see several ways to fix this issue.
__repr__ string during dumping, so that categories become simple strings. However, it will not allow us to restore original objects during loading and original DataFrame will be modified.to_json() method. It will bring unwanted dependency to the library and the need to carefully maintain it. BTW, pandas-team are planing a huge refactoring of json support: https://github.com/pandas-dev/pandas/issues/12004#issuecomment-437926069.Maybe someone else have other ideas?
As a user my preference would be to combine two of the proposals:
Otherwise the minimum would be a better error message with the suggestion
of convert categories into strings.
Thank you!
Marcello
Il Dom 11 Ago 2019, 23:16 Nikita Titov notifications@github.com ha
scritto:
@mallibus https://github.com/mallibus
Actually my thinking was that interval range would work as a category not
as a value.Yeah, you're right! Interval values are treated as categorical.
Unfortunately, LightGBM supports only simple types of categories, e.g. int,
float, string.During training LightGBM dumps pandas categories to json. It uses standard
json.dumps() function with our simple numpy array serializer:In your case, categories are pandas.Interval objects
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Interval.html
which cannot be serialized in such manner.For instance, the same error can be reproduced by trying to pass the data
where categories are Timestamps.import numpy as np
import pandas as pd
import lightgbm as lgb
df = pd.DataFrame([pd.to_datetime('01/{0}/2019'.format(i % 12 + 1)) for i in range(100)], columns=['a'])
df['Target'] = (np.random.rand(100) > 0.5).astype(int)
df['a'] = df['a'].astype('category')
X = df.drop('Target', axis=1)
y = df['Target'].astype(int)
lgb_data = lgb.Dataset(X, y)
lgb.train({}, lgb_data)
print(type(lgb_data.pandas_categorical[0][0]))
Removing pd.to_datetime() results in successful training.
I see several ways to fix this issue.
- Leave everything as is ๐ .
- Raise more user-friendly error for unsupported types in category (I
have no idea what types we should check).- Replace complicated objects by their __repr__ string during dumping,
so that categories become simple strings. However, it will not allow us to
restore original objects during loading and original DataFrame will be
modified.- Utilize pandas to_json() method
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html.
It will bring unwanted dependency to the library and the need to carefully
maintain it. BTW, pandas-team are planing a huge refactoring of json
support: pandas-dev/pandas#12004 (comment)
https://github.com/pandas-dev/pandas/issues/12004#issuecomment-437926069
.Maybe someone else have other ideas?
โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/microsoft/LightGBM/issues/2134?email_source=notifications&email_token=AK3HDRGIHOLEHP4MZ4TFN4TQEB6RPA5CNFSM4HJFHRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4BI3QY#issuecomment-520261059,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AK3HDREBUEHGDILNSFGDNMLQEB6RPANCNFSM4HJFHRCQ
.
@mallibus Thanks a lot for your feedback!
Closing in favor of being in #2302. We decided to keep all feature requests in one place.
Welcome to contribute this feature! Please re-open (or post a comment if you are not a topic starter) this issue if you are actively working on implementing this feature.
Most helpful comment
As a user my preference would be to combine two of the proposals:
keeping their ordering properties (e.g with leasing zeros in numbers so
they keep the same ordering as strings).
background.
Otherwise the minimum would be a better error message with the suggestion
of convert categories into strings.
Thank you!
Marcello
Il Dom 11 Ago 2019, 23:16 Nikita Titov notifications@github.com ha
scritto: