How do the factor variables treated?
Transformed to numeric in the levels (by order of them)?
Onehot encoding?
Used as categorical and performing categorical splitting?
I do not know what you mean by vector. xgboost treat every input feature as numerical, with support for missing values and sparsity. The decision is at the user
So if you want ordered variables, you can transform the variables into numerical levels(say age). Or if you prefer treat it as categorical variable, do one hot encoding.
tqchen (competing as crowwork) converted categorical variables to numeric variables on the criteo competition by computing smoothed conditional probabilities of a click, given the level of the factor.
numeric variable = (number of clicks within a level + (mean click rate) * ballast)/
(number of records within the level + ballast)
This is similar to what R gbm does, except that it does not use ballast and limits the number of levels in the factor to 1024. I think this is a good approach for factors which have too many levels to for one-hotting.
Automating this would be a good future enhancement; I don't have time right now but at some point I plan to clone the repository and add some features from my wish list.
jfkingiii,
this would very interesting but I think that will need a hard change in both tree growing and how it is saved and of course prediction methods.
Support for factors isn't trivial.
All these things are possible pre-processors, which can be a model that wraps xgboost, when before doing train/predict, run the pre-processing and feed processed data to xgboost. So it is not hard.
This is also reason why I do not explicit support factor in the tree construction algorithm. There could be many ways doing so, and in all the ways, having an algorithm optimized for sparse matrices is efficient for taking the processed data.
Normal tree growing algorithm only support dense numerical features, and have to support one-hot encoding factor explicitly for computation efficiency reason.
Is it possible to get a pointer to the method described here to convert categorical Data to numeric one?
@tqchen LightGBM recently got support for Categorical Features. For columns with many categorical values (thousands), where one-hot-encoding is hard, I got massive improvements. xgboost, without categorical features support, is not even a possibility.
Most helpful comment
@tqchen LightGBM recently got support for Categorical Features. For columns with many categorical values (thousands), where one-hot-encoding is hard, I got massive improvements. xgboost, without categorical features support, is not even a possibility.