Pandas: 0.24.2
Catboost: 0.13.1
#!/usr/bin/env python3
import catboost
from catboost import datasets
from sklearn.model_selection import train_test_split
(train_df, test_df) = catboost.datasets.amazon()
y = train_df.ACTION
X = train_df.drop('ACTION', axis=1).astype('category')
# .astype('category') важно, если убрать то ошибки нет, кстати с .astype('category') Pool создается намного медленнее, почему?
cat_features = list(range(9))
# странно что так работает
pool = catboost.Pool(X, y, cat_features=cat_features) # OK
# а после разбиения датасета ошибка
x_train, x_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)
train = catboost.Pool(x_train, y_train, cat_features=cat_features) # ERROR
valid = catboost.Pool(x_valid, y_valid, cat_features=cat_features) # ERROR
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2657, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/test.py", line 17, in <module>
train = catboost.Pool(x_train, y_train, cat_features=cat_features) # ERROR
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/catboost/core.py", line 291, in __init__
self._init(data, label, cat_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/catboost/core.py", line 646, in _init
self._init_pool(data, label, cat_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names)
File "_catboost.pyx", line 1942, in _catboost._PoolBase._init_pool
File "_catboost.pyx", line 1973, in _catboost._PoolBase._init_pool
File "_catboost.pyx", line 1842, in _catboost._PoolBase._init_features_order_layout_pool
File "_catboost.pyx", line 1489, in _catboost._set_features_order_data_pd_data_frame
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 868, in __getitem__
result = self.index.get_value(self, key)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 4360, in get_value
iloc = self.get_loc(key)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
Yes, I have a similar problem. It seems like there is some bad interaction with having pandas category columns in my dataframe:
not passing cat_features will break with an error like _catboost.CatboostError: Bad value for num_feature[0,3]="GROCERY": Cannot convert ... to float
cat_features will produce a KeyError like aboveHowever, if I leave those columns as string columns and pass cat_features in, CatBoost seems to work fine. BTW, I'm even able to pass cat_features in as string column names (which doesn't appear to be in the documentation yet)
@indigoviolet I temporarily install previous version of pandas (0.23.4), it works fine for me.
Just change the categorical variables to the string by .astype(str)
and when you check the data types by using .dtypes you will see them as "objects"
then pass the cat_features as a list of indices. That would work fine.
Don't change the data type to the category by .astype('category')
Has been fixed in CatBoost==0.16.
Most helpful comment
Just change the categorical variables to the string by
.astype(str)and when you check the data types by using
.dtypesyou will see them as "objects"then pass the
cat_featuresas a list of indices. That would work fine.Don't change the data type to the category by
.astype('category')