Feature Request:
Catboost should handle nan categorical features, e.g. via marking it as a special categorical entry "Unknown" or via Mode Imputation.
current state produces an error:
CatBoostClassifier().fit(pd.DataFrame({'cat_feature':['USA','ILS','CAD',pd.np.nan]}), [1,1,0,0], cat_features=[0])
Traceback (most recent call last):
File "_catboost.pyx", line 1141, in _catboost._PoolBase._set_data_from_generic_matrix
File "_catboost.pyx", line 840, in _catboost.get_id_object_bytes_string_representation
_catboost.CatboostError: bad object for id: nan
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/home/hanans/.conda/envs/retarget/lib/python3.6/site-packages/catboost/core.py", line 2153, in fit
silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval)
File "/home/hanans/.conda/envs/retarget/lib/python3.6/site-packages/catboost/core.py", line 1090, in _fit
train_pool = _build_train_pool(X, y, cat_features, pairs, sample_weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, column_description)
File "/home/hanans/.conda/envs/retarget/lib/python3.6/site-packages/catboost/core.py", line 656, in _build_train_pool
group_weight=group_weight, subgroup_id=subgroup_id, pairs_weight=pairs_weight, baseline=baseline)
File "/home/hanans/.conda/envs/retarget/lib/python3.6/site-packages/catboost/core.py", line 286, in __init__
self._init(data, label, cat_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names)
File "/home/hanans/.conda/envs/retarget/lib/python3.6/site-packages/catboost/core.py", line 637, in _init
self._init_pool(data, label, cat_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names)
File "_catboost.pyx", line 1032, in _catboost._PoolBase._init_pool
File "_catboost.pyx", line 1038, in _catboost._PoolBase._init_pool
File "_catboost.pyx", line 1161, in _catboost._PoolBase._set_data_and_feature_names
File "_catboost.pyx", line 1143, in _catboost._PoolBase._set_data_from_generic_matrix
_catboost.CatboostError: Invalid type for cat_feature[3,0]=nan : cat_features must be integer or string, real number values and NaN values should be converted to string.
catboost version: '0.11.1'
Operating System: Linux
CPU: Intel
GPU: None
We want the algorithm to work the same way if you train from file or from matrix. All values of categorical feature are treated as strings, including nan value. To treat it the same way in all cases we need a unique way to convert this value to string. This we can not do for nan value or for floating point value. For this values you must convert them to strings on your own, otherwise our conversion to string might differ from what you do when you write a file.
It's pretty straight forward to add it to the python wrapper. Am I
mistaken? I can contribute if needed.
On Wed, Dec 5, 2018, 19:46 annaveronika <[email protected] wrote:
We want the algorithm to work the same way if you train from file or from
matrix. All values of categorical feature are treated as strings, including
nan value. To treat it the same way in all cases we need a unique way to
convert this value to string. This we can not do for nan value or for
floating point value. For this values you must convert them to strings on
your own, otherwise our conversion to string might differ from what you do
when you write a file.—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/catboost/catboost/issues/571#issuecomment-444577516,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFtFzNIIL-_Zj2zVbo0TtKdfTIxmZEDDks5u2AaOgaJpZM4ZCbPR
.
It's not as obvious as it looks like. The problem here is that when you read your dataset to a data frame in python, then all your 'NaN', 'nan', 'NA' and so on strings are converted to nan value. But when you will be doing predictions from file they will be treated the same way all other strings are treated. We will calculate a hash value of it. So you will get wrong predictions without any warning.
For now we are staying on the safe side.
https://catboost.ai/docs/concepts/faq.html#why-float-and-nan-values-are-forbidden-for-cat-features - here are the docs about nan and float categorical features.
Here is the description of proposed solution, which is a great way to contribute to catboost:
https://github.com/catboost/catboost/blob/master/open_problems/open_problems.md
(adding flag 'allow_nan_categories').
So I'm adding help wanted and good first issue labels for adding this flag.