Catboost: How does catboost convert categorical features in test dataset to numeric features

Created on 10 Dec 2018 · 4Comments · Source: catboost/catboost

Based on the article on catboost website (https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/), the catboost uses the response to convert categorical feature to numerical features. I can understand that it works for training data with response, but I am wondering how it works for prediction that does not have responses.

Thanks

documentation

Source

zhenyiy

Most helpful comment

Also in my opinion it would be important to document how to figure out from a fitted model, for each level of each categorical feature, how to get reported the value used at test time. It could be used as a way to check about how the CatBoost quantized the categorical features both for a sanity check and for getting help in model interpretability. I would also dare to say that, besides providing documentation or a tutorial in doing so, also providing the classifier/regressor with such a method, that could easily provide that information to user, would be indeed a very useful addition.

Thank you :-)

lmassaron on 24 May 2019

👍4

All 4 comments

For regression we do the following: quantization is performed on the label value. The mode and number of buckets are set in the starting parameters. For each selected border we create a numerical feature. For the selected border all values located on the left of it are considered negative response, and all labels on the right are considered positive response.

annaveronika on 11 Dec 2018

Thank you :-)

lmassaron on 24 May 2019

👍4

Hi, @annaveronika, also confused here,

for the example in https://github.com/catboost/tutorials/blob/master/competition_examples/kaggle_paribas.ipynb

clf = CatBoostClassifier(learning_rate=0.1, iterations=1000, random_seed=0, logging_level='Silent')
clf.fit(train_df, labels, cat_features=cat_features_ids)

want to know the trick applied on cat_features, is it one_hot_encoding mentioned here?

sorry not get the point of

For the selected border all values located on the left of it are considered negative response, and all labels on the right are considered positive response.

for the example shown in https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html

is this means the borders will be generated while training(if simple_ctr specified), but how will the testing data x be treated? using whole rows in each border to calculate the f_n?

sorry for my dumb questions, thank you!

penolove on 27 May 2019

@zhenyiy

I was struggling with this too and I found the aswer in their paper:

It means that, although TS in the training set is sequential, TS in the test set (or any other set) uses all available x and y from the training set, where x is categorical, for each instance that is being scored.

Reference

https://arxiv.org/pdf/1706.09516.pdf