Catboost: How does catboost convert categorical features in test dataset to numeric features

Created on 10 Dec 2018  路  4Comments  路  Source: catboost/catboost

Based on the article on catboost website (https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/), the catboost uses the response to convert categorical feature to numerical features. I can understand that it works for training data with response, but I am wondering how it works for prediction that does not have responses.

Thanks

documentation

Most helpful comment

Also in my opinion it would be important to document how to figure out from a fitted model, for each level of each categorical feature, how to get reported the value used at test time. It could be used as a way to check about how the CatBoost quantized the categorical features both for a sanity check and for getting help in model interpretability. I would also dare to say that, besides providing documentation or a tutorial in doing so, also providing the classifier/regressor with such a method, that could easily provide that information to user, would be indeed a very useful addition.

Thank you :-)

All 4 comments

For regression we do the following: quantization is performed on the label value. The mode and number of buckets are set in the starting parameters. For each selected border we create a numerical feature. For the selected border all values located on the left of it are considered negative response, and all labels on the right are considered positive response.

Also in my opinion it would be important to document how to figure out from a fitted model, for each level of each categorical feature, how to get reported the value used at test time. It could be used as a way to check about how the CatBoost quantized the categorical features both for a sanity check and for getting help in model interpretability. I would also dare to say that, besides providing documentation or a tutorial in doing so, also providing the classifier/regressor with such a method, that could easily provide that information to user, would be indeed a very useful addition.

Thank you :-)

Hi, @annaveronika, also confused here,

  1. for the example in https://github.com/catboost/tutorials/blob/master/competition_examples/kaggle_paribas.ipynb
clf = CatBoostClassifier(learning_rate=0.1, iterations=1000, random_seed=0, logging_level='Silent')
clf.fit(train_df, labels, cat_features=cat_features_ids)

want to know the trick applied on cat_features, is it one_hot_encoding mentioned here?

  1. sorry not get the point of
For the selected border all values located on the left of it are considered negative response, and all labels on the right are considered positive response.

for the example shown in https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html
image
is this means the borders will be generated while training(if simple_ctr specified), but how will the testing data x be treated? using whole rows in each border to calculate the f_n?

sorry for my dumb questions, thank you!

@zhenyiy

I was struggling with this too and I found the aswer in their paper:

It means that, although TS in the training set is sequential, TS in the test set (or any other set) uses all available x and y from the training set, where x is categorical, for each instance that is being scored.

Reference

https://arxiv.org/pdf/1706.09516.pdf

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jkhlot picture jkhlot  路  3Comments

abdullahalsaidi16 picture abdullahalsaidi16  路  3Comments

DBusAI picture DBusAI  路  3Comments

khrisanfov picture khrisanfov  路  4Comments

beloteloff picture beloteloff  路  4Comments