I commented earlier on #2761, but since its closed i'm not sure if i'll get the visibility from the contributors.
TL;DR: If I train a model with lgb.train( ), which contains categorical features passed from a pandas DataFrame with dtype = category. Does the model when applied to a new dataset make decisions on the underlying category codes, or on the Label. This is of importance to me, since my train and deploy sets don't share all of the categories, thus I'm wondering if then the model is tricked into wrong decisions sine now the cat.codes don't necessarily point to the same label.
Copy/paste from my comment:
I'm running into a similar problem. Let's say i have a pandas DataFrame with a categorical column 'Country'. What happens if there are more entries in the training set compared to the deploy one.
Take for example the following case:
X_train = pd.Series(['Germany','France','Italy','Spain','Belgium'],
name='X', dtype='category').to_frame()
X_deploy = pd.Series(['Germany','France','Spain','Belgium'],
name='X', dtype='category').to_frame()
# Pandas mapping for each df:
dict( enumerate(X_train['X'].cat.categories ) )
# {0: 'Belgium',
# 1: 'France',
# 2: 'Germany',
# 3: 'Italy',
# 4: 'Spain'}
dict( enumerate(X_deploy['X'].cat.categories ) )
# {0: 'Belgium',
# 1: 'France',
# 2: 'Germany',
# 3: 'Spain'}
Thus if i now train my model, if i save the model to .json and look into the decision it made I'd find something of the type:
"right_child": {
"split_index": 22,
"split_feature": 1,
"split_gain": 0.06720539927482605,
"threshold": "0||2||3",
"decision_type": "==",
"default_left": false,
"missing_type": "None",
"internal_value": -0.00810006,
"internal_weight": 771,
"internal_count": 771,
I see that it made a decision based on my country feature, specifically if Country is Belgium, Germany or Italy, but it displays the category codes.
Thus, if i now apply this model on my deploy set, with a .predict(), since the code 3 now pertains to Spain instead of Italy, will it wrongly compute the result for entries in Spain, now that they have category code 3?
If so how i can i avoid this?
Basically its reversing the question OP posed, and thus talking in the general sense.
Thank you for your hard work.
Hey @ggmblr !
Thanks for your question!
LightGBM will set old categorical codes from a training phase during a prediction one (if they differ). Refer to
https://github.com/microsoft/LightGBM/blob/e676af236625fcfc548ca60d0b89f360c736a744/python-package/lightgbm/basic.py#L345-L346
So internally prediction data will have
# {0: 'Belgium',
# 1: 'France',
# 2: 'Germany',
# 3: 'Italy',
# 4: 'Spain'}
all original codes from the training dataset.
Thank you @StrikerRUS , your indication as where to look for was very helpful!
Just for future reference and to make sure I understood the workflow of the code, I'm going to try to make a brief summary of how everything is handled as to get to that point.
First we have class Dataset(object):
In which the method _lazy_init sets the first record of the categorical features used, by extracting them from the training set using the function _data_from_pandas. As for the test set, you should be referencing the training set when initializing it, thus assigning the train value for pandas_categorical. Great! First hurdle passed, train and test set share the same mapping from label to cat.code!
The next step is creating our model, which is created with the class Booster(object):
In here it gets created/initialized with the method __init__, which inherits the value for pandas_categorical from the training Dataset. (Or it reads it in from a saved model you are loading. In this step it is enforced that you either load a model, or read in the training data with Dataset type).
Now onto predicting values for new data.
It uses the method predictfrom the Booster class, which in turn calls the method _to_predictor from the same class. This creates a new object, this time created with the class _InnerPredictor. This inherits all the necessary info from the trained Booster, with exception of the pandas_categorical, as it is set to None... but wait! In the _to_predictor method it then gets swiftly updated with the values from the Booster. ( that was a close one )
This new object of the class _InnerPredictor has also a method called predict, which now reads in our new data and passes it to the function we used beforehand: _data_from_pandas. But this time also passing in the training values for pandas_categorical. And here is where the code snippet @StrikerRUS linked to comes into play. It remaps data[col].cat.codes from the default codes generated by pandas while defining the column as a category, to the code mapping used in the training set, when both of the mappings are not equal.
I hope this is correct, if not please correct me where necessary. And if so, I hope this helps someone out in following the workflow of the code in the future (at least as a reference for myself it will).
Thanks!
@ggmblr Awesome explanation! It seems that you're absolutely correct! We will definitely use your summary as the answer for future questions about how categorical mappings are handled internally. Thank you very much!
For the future readers: here we are speaking about the case when at the prediction stage some categorical levels are missed. The case when at the prediction stage new categories are met is more complicated and was discussed in #2761.
Great! I'm happy that i could convey the idea correctly and hopefully it helps someone else out.
I'm closing this isue. Thanks for the help.
Most helpful comment
Thank you @StrikerRUS , your indication as where to look for was very helpful!
Just for future reference and to make sure I understood the workflow of the code, I'm going to try to make a brief summary of how everything is handled as to get to that point.
First we have
class Dataset(object):In which the method
_lazy_initsets the first record of the categorical features used, by extracting them from the training set using the function_data_from_pandas. As for the test set, you should be referencing the training set when initializing it, thus assigning the train value forpandas_categorical. Great! First hurdle passed, train and test set share the same mapping from label to cat.code!The next step is creating our model, which is created with the
class Booster(object):In here it gets created/initialized with the method
__init__, which inherits the value forpandas_categoricalfrom the trainingDataset. (Or it reads it in from a saved model you are loading. In this step it is enforced that you either load a model, or read in the training data withDatasettype).Now onto predicting values for new data.
It uses the method
predictfrom theBoosterclass, which in turn calls the method_to_predictorfrom the same class. This creates a new object, this time created with theclass _InnerPredictor. This inherits all the necessary info from the trainedBooster, with exception of thepandas_categorical, as it is set toNone... but wait! In the_to_predictormethod it then gets swiftly updated with the values from theBooster. ( that was a close one )This new object of the class
_InnerPredictorhas also a method calledpredict, which now reads in our new data and passes it to the function we used beforehand:_data_from_pandas. But this time also passing in the training values forpandas_categorical. And here is where the code snippet @StrikerRUS linked to comes into play. It remapsdata[col].cat.codesfrom the default codes generated by pandas while defining the column as a category, to the code mapping used in the training set, when both of the mappings are not equal.I hope this is correct, if not please correct me where necessary. And if so, I hope this helps someone out in following the workflow of the code in the future (at least as a reference for myself it will).
Thanks!