I have a train pandas.DataFrame like
target,feat1,feat2
0,cat,0.5
1,dog, 0.8
and a validation frame which contains
target,feat1,feat2
0,cow, 0.3
feat1 is of type pandas.Categorical.
A) how should the categorical be properly passed to lightGbm
B) how can I ensure that predictions of observations with unseen categorical values are handled correctly i.e. gracefully by utilizing the rest of the available columns?
for A: the string will be converted to int type in python, and a mapper will be saved.
for B: The unseen categories will be handled correctly, you don't need to worry about it. But if there are many unseen categories, the accuracy may be not good.
@guolinke Which parameters should be used to pas a pandas.Dataframe with categoricals? Please could you link an example. https://github.com/Microsoft/LightGBM/issues/699 mentions several recently new introduced parameters. Assuming one columns is i.e. address which contains a high number of different categorical labels (=levels) what parameters would you suggest here?
@guolinke also https://github.com/Microsoft/LightGBM/issues/751 you mention that it should be of type int. Is Int or String as mentioned here the correct type?
@geoHeil The Python wrapper abstracts the categorical conversion (String -> Int) and converts it for you. This is not the case if using CLI.
@geoHeil refer to http://lightgbm.readthedocs.io/en/latest/Advanced-Topic.html#categorical-feature-support .
Is there any chance that someone could elaborate on exactly how unseen cateogrical data would be handled in lightgbm?
That it will be handed "correctly" is not very informative.
So, in the case where the model in production is faced with cow (say, labelled 3), and the model is trained on cat and dog (labelled 1 and 2), how would lightgbm interpret a cow in this case?
@nicolaiiversen92
It will learn is cat, is dog. And cow is treated not cat and not dog.
@guolinke
So let's say that a given tree splits on species. Dogs in one branch, cats in another. Exactly where, will the cow be put?
@nicolaiiversen92 for categorical features, it does not have two conditions for two branch.
it is like
if Is A or Is B or ... :
go to left branch
else:
go to right branch
@guolinke Thanks
Most helpful comment
@nicolaiiversen92 for categorical features, it does not have two conditions for two branch.
it is like