Machinelearning: How to mix categorical and numerical features in LightGbm?

Created on 9 Jul 2018  路  7Comments  路  Source: dotnet/machinelearning

I am implementing a LightGBM example where I have a mix of categorical and numerical features, and can't figure how this should be done in ML.NET.

In Python, LightGBM accepts a 'categorical_feature' parameter, giving the possibility to specify if a feature should be handled as categorical or numerical/ordinal. I can not find that this parameter is available in the ML.NET version. Could this be added?

enhancement need info

Most helpful comment

Actually, I stand corrected. The categorical features do get a special treatment in our FastTree learner, and likely in LightGBM as well. So the question is valid. It could and should be added.

While you are waiting for this, consider using the above mechanism (CategoricalOneHotVectorizer) and the FastTree learner with CategoricalSplit parameter enabled.

All 7 comments

Hi @petterton , thanks for your question!

The lack of 'categorical_feature' parameter on LightGBM was a deliberate design choice that we made.

In ML.NET, we have a philosophy that the learner (like LightGBM) should not be in the business of feature engineering, unless it does something very special to it.

As far as I understand, LightGBM doesn't do anything special to categorical values. You could replicate its handling of categorical inputs by applying CategoricalOneHotVectorizer to your categorical-value columns, then joining them with your numeric columns and feeding that into the LightGBM trainer.

Actually, I stand corrected. The categorical features do get a special treatment in our FastTree learner, and likely in LightGBM as well. So the question is valid. It could and should be added.

While you are waiting for this, consider using the above mechanism (CategoricalOneHotVectorizer) and the FastTree learner with CategoricalSplit parameter enabled.

Thanks for your answer @Zruty0! Yes, LightGBM seems to give special treatment to categorical features during training, some more details can be seen here: https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#categorical-feature-support. In particular one-hot encoding seems to be sub-optimal for such a tree, but is of course OK as a temporary solution.

On that page it is also stated that

Categorical features must be encoded as non-negative integers (int) less than Int32.MaxValue (2147483647). It is best to use a contiguous range of integers

I am not sure if this complicates the implementation in ML.NET... at least every example I have seen in ML.NET has ended with a feature vector of a single data type, while a problem with a mixture of numerical and categorical features would (if I understand this correctly) give a feature vector with multiple data types.

Yes, we do not allow vector columns with mixed types, so this would not be feasible.

We could solve it in two ways:
1) Make LightGBM accept more than one feature column. There's nothing preventing us from doing this, it's just not a common thing for a learner.

2) Do it the FastTree way. There, we essentially 'one-hot-decoded' the categorical features and applied FastTree to the original features, even though they were bundled with the rest of the features in the 'Features' column.

I am not sure which of the 2 approaches will be more appropriate at the moment.

@Zruty0 : Any news on when this will be added?

@petterton Unfortunately I don't have any news on that at the moment. This work item is on our backlog, and right now we're handling higher priority things (most related to public API surface).

I would use categorical transform on your categorical features and mix it with numerical features and use FastTree/LightGBM with useCat = true. The features vector will have metadata that will tell the trainer which of the features in the vector are categorical and they will get special treatment.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rebecca-burwei picture rebecca-burwei  路  3Comments

sfilipi picture sfilipi  路  4Comments

daholste picture daholste  路  3Comments

rogancarr picture rogancarr  路  3Comments

ddobric picture ddobric  路  4Comments