Lightgbm: [feature request] Support int64_t data_size_t

Created on 24 Feb 2020 · 9Comments · Source: microsoft/LightGBM

As the cnt in the histogram is removed, the overhead of int64_t as data_size_t will be smaller.

There are several todo items:

[ ] remove data_size_t completely in code, directly used int64_t
[ ] Template types for the indices in DataPartition, to achieve the same performance as before when the data_size_t is smaller than int32_t.
[ ] update related c_api, and Python / R interfaces

feature request help wanted in progress

Source

guolinke

Most helpful comment

ok, i will try my best.

Adam1679 on 16 Mar 2020

👍2 🎉1

All 9 comments

I have a branch (https://github.com/microsoft/LightGBM/tree/sparse_bin_int64) working on this, will revisit it when have time.

guolinke on 24 Feb 2020

hey, can i help this one?

Adam1679 on 15 Mar 2020

@Adam1679 yeah, very welcome

guolinke on 16 Mar 2020

sorry, this is my first time to make public contribution. Any suggestions about how to begin this one? Should I fork your branch and get familiar with the code and your modifications?

Adam1679 on 16 Mar 2020

yes, you can fork the code and code in your cloned repo first, and create a PR here when it is almost ready.
If there are any problems/questions, you can ask me directly in this issue.
BTW, it is better to start from master, not the sparse_bin_int64, as there are many new changes in master branch.

guolinke on 16 Mar 2020

ok, i will try my best.

Adam1679 on 16 Mar 2020

👍2 🎉1

hey, I read your commit "5a5b26d" of your branch and I saw that you changed data_size_t to from int32_t to int64_t directly. And also you changed some "int" to "data_size_t" and some "data_size_t" to "int". I understand that you cast some "data_size_t" to "int" since now "data_size_t" is bigger. But why do you change some "int" to "data_size_t"? I don't understand this part.

Also, it would be highly appreciated if you can elaborate more on the "the overhead of int64_t as data_size_t will be smaller." Is it because that "int32_t" is not big enough in some cases but you have to compromise to use it since the overhead of using int64 in the cnt related part is unbearable?

Adam1679 on 18 Mar 2020

@Adam1679 some of int should be data_size_t, so i fix them.

For the overheads:
1) the cnt in histogram, this is removed. So you don't need to care it any more.
2) the data_size_t in src/tree_learner/DataPartition.hpp . DataPartition stores the used row indices for each leaf. These indices need to be accessed when ConstructHistogram and Split. Changing it to int64_t will increase the memory cost. So I think we should use template here, for the small data, LightGBM will auto use int32_t (even int16_t).

guolinke on 18 Mar 2020

Sorry that I left for a long time. I'm trying to figure out a solution for such kind of template. But I did a lot of stack overflow search and found that there is no solution for such kind of runtime template. (template is compile time). So i think what we need is polymorphism. for example, construct a base class and subclass it using different data size. But this may result in other problems like the base class cannot have any fields because the type is undetermined. Am I on the right track?

Adam1679 on 22 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings