Catboost: Support pandas DataFrame type 'category'

Created on 2 Apr 2019  ·  4Comments  ·  Source: catboost/catboost

Pandas: 0.24.2
Catboost: 0.13.1

#!/usr/bin/env python3
import catboost
from catboost import datasets
from sklearn.model_selection import train_test_split

(train_df, test_df) = catboost.datasets.amazon()
y = train_df.ACTION
X = train_df.drop('ACTION', axis=1).astype('category')
# .astype('category') важно, если убрать то ошибки нет, кстати с .astype('category') Pool создается намного медленнее, почему?
cat_features = list(range(9))

# странно что так работает
pool = catboost.Pool(X, y, cat_features=cat_features)  # OK

# а после разбиения датасета ошибка
x_train, x_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)
train = catboost.Pool(x_train, y_train, cat_features=cat_features)  # ERROR
valid = catboost.Pool(x_valid, y_valid, cat_features=cat_features)  # ERROR
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2657, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/test.py", line 17, in <module>
    train = catboost.Pool(x_train, y_train, cat_features=cat_features)  # ERROR
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/catboost/core.py", line 291, in __init__
    self._init(data, label, cat_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/catboost/core.py", line 646, in _init
    self._init_pool(data, label, cat_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names)
  File "_catboost.pyx", line 1942, in _catboost._PoolBase._init_pool
  File "_catboost.pyx", line 1973, in _catboost._PoolBase._init_pool
  File "_catboost.pyx", line 1842, in _catboost._PoolBase._init_features_order_layout_pool
  File "_catboost.pyx", line 1489, in _catboost._set_features_order_data_pd_data_frame
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 868, in __getitem__
    result = self.index.get_value(self, key)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 4360, in get_value
    iloc = self.get_loc(key)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2659, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
planned

Most helpful comment

Just change the categorical variables to the string by .astype(str)
and when you check the data types by using .dtypes you will see them as "objects"
then pass the cat_features as a list of indices. That would work fine.

Don't change the data type to the category by .astype('category')

All 4 comments

Yes, I have a similar problem. It seems like there is some bad interaction with having pandas category columns in my dataframe:

  • not passing cat_features will break with an error like _catboost.CatboostError: Bad value for num_feature[0,3]="GROCERY": Cannot convert ... to float

    • passing cat_features will produce a KeyError like above

However, if I leave those columns as string columns and pass cat_features in, CatBoost seems to work fine. BTW, I'm even able to pass cat_features in as string column names (which doesn't appear to be in the documentation yet)

@indigoviolet I temporarily install previous version of pandas (0.23.4), it works fine for me.

Just change the categorical variables to the string by .astype(str)
and when you check the data types by using .dtypes you will see them as "objects"
then pass the cat_features as a list of indices. That would work fine.

Don't change the data type to the category by .astype('category')

Has been fixed in CatBoost==0.16.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Rajat700 picture Rajat700  ·  3Comments

sosmond picture sosmond  ·  4Comments

idavydov picture idavydov  ·  3Comments

chanansh picture chanansh  ·  4Comments

mathankumart picture mathankumart  ·  4Comments