Lightgbm: [Enhancement] Better Regularization for Categorical features

Created on 2 Jan 2019 · 12Comments · Source: microsoft/LightGBM

It seems current split finding algorithm for categorical features often results in over-fitting.
We need a better solution to reduce the over-fitting.

feature request help wanted

Source

guolinke

👍6

Most helpful comment

I remember CatBoost's solutions didn't related to GBDT's algorithm, as they preprocess the features. Thus, these solutions could be in other GBDT tool as well.

guolinke on 4 Jan 2019

👍4

All 12 comments

code of the current solution:
https://github.com/Microsoft/LightGBM/blob/fb28070e1daa500b087d3102145ae48988030195/src/treelearner/feature_histogram.hpp#L112-L234

The process:

sort the bins in the feature histogram by the sum_gradients / sum_hessians
find the split points in the sorted histogram

Current Regularizations:

convert to one-vs-rest when num_cats <= config->max_cat_to_onehot
add cat_smooth to sum_hessians the before sorting the histogram
only enumerate max_cat_threshold times, to avoid searching too many times.
use min_data_per_group to avoid too little data in each split point.
an additional cat_l2 for L2 regularization.

New Regularizations (proposal):

Some randomize in sorting
don't use the cat(bin) with too little data as split point.

Your idea is very welcome here 😄 .

guolinke on 2 Jan 2019

ping @Laurae2 @StrikerRUS @henry0312

guolinke on 2 Jan 2019

The first thing that comes to my mind after the word "categorical" is catboost with their target encoding and feature interactions: https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/, https://tech.yandex.com/catboost/doc/dg/concepts/speed-up-training-docpage/#max-ctr-, (CTR settings sections) https://tech.yandex.com/catboost/doc/dg/concepts/cli-reference_train-model-docpage/#cli-reference_train-model__options

StrikerRUS on 3 Jan 2019

I remember CatBoost's solutions didn't related to GBDT's algorithm, as they preprocess the features. Thus, these solutions could be in other GBDT tool as well.

guolinke on 4 Jan 2019

👍4

@guolinke

I remember CatBoost's solutions didn't related to GBDT's algorithm, as they preprocess the features. > Thus, these solutions could be in other GBDT tool as well.

That's right.

To be honest, I don't understand the current way very well, however, I think less parameter is better because more paramers leads to overfitting.
And there is a famous solution for categorical (I don't remember it now).
I'll tell you later.

henry0312 on 9 Jan 2019

As you know, CART is one of the famous algorithms, which is used in scikit-learn.
(A Step by Step CART Decision Tree Example may be useful for understanding)

scikit-learn uses an optimised version of the CART algorithm; however, scikit-learn implementation does not support categorical variables for now.

from https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart.

However, scikit-learn doesn't support categorical variables 😢

henry0312 on 9 Jan 2019

In some experiments with private data, the regularization hyperparameters for the categorical variables (max_cat_threshold, cat_l2, cat_smooth and max_cat_to_onehot) can dramatically change the bias and/or variance of the final model if the dataset contains lots of categorical features. However, I have not found an obvious way to tune these hyperparameters (other than exhaustive grid search).

julioasotodv on 25 Feb 2019

👍1

@guolinke

I remember CatBoost's solutions didn't related to GBDT's algorithm, as they preprocess the features. Thus, these solutions could be in other GBDT tool as well.

If you look at this https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html it says "_Before each split is selected in the tree..._", so it's not really a preprocessing step, because the encoding is performed before each split, so it's something that needs to be implemented in the GBDT/RF algorithm. Because the encoding is performed on different data splits, this reduces overfitting.

ekerazha on 20 Mar 2019

I think one of the reasons for overfitting categorical features is that the categories are ordered by weight before finding splits. This means that the gain from a split on a categorical feature will in general be larger than for a numerical feature, and categorical features are disproportionately likely to be chosen for a split vs numerical features.

For example, if we make only one split in some categorical feature (after this ordering has been done), then the feature is already optimally divided into top n and bottom num_bins - n weights. But if instead we consider the feature as numerical (using some random integer encoding of the categories), then in general the weight of the bins will not be monotone increasing, so it will take several splits to achieve this same separation in weights. Therefore the gains will in general be larger from splitting on categorical vs numerical features, so categorical features will more often be chosen as the best split by the algorithm.

L2 regularization (lambda_l2, cat_l2, cat_smooth) helps somewhat with this problem, since it reduces the magnitude of the weights (especially for small categories) and the potential gain of each split. But it might also be useful to have an L1 regularization for categorical features (i.e. a cat_l1), which would directly penalize the number of splits on a categorical feature.

btrotta on 9 Jul 2019

👍3

@btrotta I agree with " overfitting categorical features" part and I also think it's because we already optimize the order for splitting. But the example is not clear yet. Comparing "ordering categorical features" and "random integer encoding of the categories" in the example is somewhat not fair. It should be a comparison between "ordering categorical features" and some "true numerical features". I think the numerical themselves have some attributes of "ordering". For example, with "age" features, sorting the age makes sense and could be considered as a way of ordering, though it's not optimized as categorical ones.

chaupmcs on 9 Jul 2019

@chaupmcs Yes, that's a fair point. A true numerical feature would probably have bin weights with longer monotone intervals, and fewer turning points, compared to a random encoding of a categorical feature. But still, unless the numerical feature is completely monotone, the categorical feature offers easier gains from splitting.

btrotta on 9 Jul 2019

Closing in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open (or post a comment if you are not a topic starter) this issue if you are actively working on implementing this feature.

StrikerRUS on 1 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings