Lightgbm: [Enhancement] Better Regularization for Categorical features

Created on 2 Jan 2019  路  12Comments  路  Source: microsoft/LightGBM

It seems current split finding algorithm for categorical features often results in over-fitting.
We need a better solution to reduce the over-fitting.

feature request help wanted

Most helpful comment

I remember CatBoost's solutions didn't related to GBDT's algorithm, as they preprocess the features. Thus, these solutions could be in other GBDT tool as well.

All 12 comments

code of the current solution:
https://github.com/Microsoft/LightGBM/blob/fb28070e1daa500b087d3102145ae48988030195/src/treelearner/feature_histogram.hpp#L112-L234

The process:

  1. sort the bins in the feature histogram by the sum_gradients / sum_hessians
  2. find the split points in the sorted histogram

Current Regularizations:

  1. convert to one-vs-rest when num_cats <= config->max_cat_to_onehot
  2. add cat_smooth to sum_hessians the before sorting the histogram
  3. only enumerate max_cat_threshold times, to avoid searching too many times.
  4. use min_data_per_group to avoid too little data in each split point.
  5. an additional cat_l2 for L2 regularization.

New Regularizations (proposal):

  1. Some randomize in sorting
  2. don't use the cat(bin) with too little data as split point.

Your idea is very welcome here 馃槃 .

ping @Laurae2 @StrikerRUS @henry0312

The first thing that comes to my mind after the word "categorical" is catboost with their target encoding and feature interactions: https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/, https://tech.yandex.com/catboost/doc/dg/concepts/speed-up-training-docpage/#max-ctr-, (CTR settings sections) https://tech.yandex.com/catboost/doc/dg/concepts/cli-reference_train-model-docpage/#cli-reference_train-model__options

I remember CatBoost's solutions didn't related to GBDT's algorithm, as they preprocess the features. Thus, these solutions could be in other GBDT tool as well.

@guolinke

I remember CatBoost's solutions didn't related to GBDT's algorithm, as they preprocess the features. > Thus, these solutions could be in other GBDT tool as well.

That's right.

To be honest, I don't understand the current way very well, however, I think less parameter is better because more paramers leads to overfitting.
And there is a famous solution for categorical (I don't remember it now).
I'll tell you later.

As you know, CART is one of the famous algorithms, which is used in scikit-learn.
(A Step by Step CART Decision Tree Example may be useful for understanding)

scikit-learn uses an optimised version of the CART algorithm; however, scikit-learn implementation does not support categorical variables for now.

from https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart.

However, scikit-learn doesn't support categorical variables 馃槩

In some experiments with private data, the regularization hyperparameters for the categorical variables (max_cat_threshold, cat_l2, cat_smooth and max_cat_to_onehot) can dramatically change the bias and/or variance of the final model if the dataset contains lots of categorical features. However, I have not found an obvious way to tune these hyperparameters (other than exhaustive grid search).

@guolinke

I remember CatBoost's solutions didn't related to GBDT's algorithm, as they preprocess the features. Thus, these solutions could be in other GBDT tool as well.

If you look at this https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html it says "_Before each split is selected in the tree..._", so it's not really a preprocessing step, because the encoding is performed before each split, so it's something that needs to be implemented in the GBDT/RF algorithm. Because the encoding is performed on different data splits, this reduces overfitting.

I think one of the reasons for overfitting categorical features is that the categories are ordered by weight before finding splits. This means that the gain from a split on a categorical feature will in general be larger than for a numerical feature, and categorical features are disproportionately likely to be chosen for a split vs numerical features.

For example, if we make only one split in some categorical feature (after this ordering has been done), then the feature is already optimally divided into top n and bottom num_bins - n weights. But if instead we consider the feature as numerical (using some random integer encoding of the categories), then in general the weight of the bins will not be monotone increasing, so it will take several splits to achieve this same separation in weights. Therefore the gains will in general be larger from splitting on categorical vs numerical features, so categorical features will more often be chosen as the best split by the algorithm.

L2 regularization (lambda_l2, cat_l2, cat_smooth) helps somewhat with this problem, since it reduces the magnitude of the weights (especially for small categories) and the potential gain of each split. But it might also be useful to have an L1 regularization for categorical features (i.e. a cat_l1), which would directly penalize the number of splits on a categorical feature.

@btrotta I agree with " overfitting categorical features" part and I also think it's because we already optimize the order for splitting. But the example is not clear yet. Comparing "ordering categorical features" and "random integer encoding of the categories" in the example is somewhat not fair. It should be a comparison between "ordering categorical features" and some "true numerical features". I think the numerical themselves have some attributes of "ordering". For example, with "age" features, sorting the age makes sense and could be considered as a way of ordering, though it's not optimized as categorical ones.

@chaupmcs Yes, that's a fair point. A true numerical feature would probably have bin weights with longer monotone intervals, and fewer turning points, compared to a random encoding of a categorical feature. But still, unless the numerical feature is completely monotone, the categorical feature offers easier gains from splitting.

Closing in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open (or post a comment if you are not a topic starter) this issue if you are actively working on implementing this feature.

Was this page helpful?
0 / 5 - 0 ratings