Lightgbm: max number of categories for Categorical features

Created on 21 Mar 2018  路  11Comments  路  Source: microsoft/LightGBM

Suppose that I have a dataset with a categorical variable encoded as int. In the train function of lightgbm, I declare this variable as categorical

lgb.train(..., categorical_feature=['my_categorical_feature'], ...)

I would like to know:

  • What is the maximum number of different categories that lightGBM can handle ?

  • Is it related to the max_bin parameter?

  • if yes, what happens if max_bin is set to 32 and there is 256 000 unique categories for 'my_categorical_feature'?

Thanks

(I posted the question on StackOverflow a week ago but there is few views and 0 answers / comments)

Most helpful comment

details are in code: https://github.com/Microsoft/LightGBM/blob/master/src/io/bin.cpp#L207-L389
max_bin can limit the #bin for numerical features.
For the categorical feature, refer to the code here: https://github.com/Microsoft/LightGBM/blob/master/src/io/bin.cpp#L331-L350 .
In short, the condition is used_cnt < cut_cnt || num_bin_ < max_bin , where cut_cnt=0.99*#data.
So when #category is smaller than max_bin, the #bin is smaller than max_bin. otherwise it use the most frequent categories and stop when use 99% data.

The memory is easy to calculate.
For example, max_bin < 16, one entry (one feature value in a row) will cost 1/2 byte. And 1 byte for max_bin < 256, and 2 bytes when max_bin < 65536, and 4 bytes when max_bin < 2^32 .

So, when most of your categorical feature are smaller than 255 bins, the total memory cost is about #data x #feature x 1 bytes.

All 11 comments

it depends on the memories of your machine.

Thank you for your prompt reply!

But can you please be more explicit?
especially about the 3 points?

Let's say I am working on a windows 7 (64bits) machine with 3To HDD and 8Go RAM

What is the maximum number of different categories that lightGBM can handle ?

it depends the memories and #data of your dataset.

Is it related to the max_bin parameter?

it is related, but doesn't have much impact. The bin of categorical feature isn't fully controlled by max_bin.

8GB memory may is not enough if your #data is large.

Thanks
How are controlled the bin of categorical features then?
Setting max_bin = 32 with 256 000 categories, lightgbm will not group different categories in the same bin?

@fl2o it depends how many different categories in each feature.

It's still unclear what role plays max_bin for categorical, and the link between number of categorical features, number of categories per feature and memory
Maybe some examples can highlight the behaviour ?

details are in code: https://github.com/Microsoft/LightGBM/blob/master/src/io/bin.cpp#L207-L389
max_bin can limit the #bin for numerical features.
For the categorical feature, refer to the code here: https://github.com/Microsoft/LightGBM/blob/master/src/io/bin.cpp#L331-L350 .
In short, the condition is used_cnt < cut_cnt || num_bin_ < max_bin , where cut_cnt=0.99*#data.
So when #category is smaller than max_bin, the #bin is smaller than max_bin. otherwise it use the most frequent categories and stop when use 99% data.

The memory is easy to calculate.
For example, max_bin < 16, one entry (one feature value in a row) will cost 1/2 byte. And 1 byte for max_bin < 256, and 2 bytes when max_bin < 65536, and 4 bytes when max_bin < 2^32 .

So, when most of your categorical feature are smaller than 255 bins, the total memory cost is about #data x #feature x 1 bytes.

Thank you very much!
It is now very clear
"used_cnt < cut_cnt , where cut_cnt=0.99*#data " was the point I was missing :)

Have a nice day

I can't find out the code limiting the number of bins by calculating memory usage in the source file. Could you please give me the link?

One more question:

  • How are the minor 1% category data treated? Assume them as same as NaN?

@fujii it isn't explicitly limited by memory. It will cause error if your memory is not enough for the categorical features.
yeah, the low frequent categories will be treated as NaN.

Was this page helpful?
0 / 5 - 0 ratings