As the cnt in the histogram is removed, the overhead of int64_t as data_size_t will be smaller.
There are several todo items:
data_size_t completely in code, directly used int64_tindices in DataPartition, to achieve the same performance as before when the data_size_t is smaller than int32_t. I have a branch (https://github.com/microsoft/LightGBM/tree/sparse_bin_int64) working on this, will revisit it when have time.
hey, can i help this one?
@Adam1679 yeah, very welcome
sorry, this is my first time to make public contribution. Any suggestions about how to begin this one? Should I fork your branch and get familiar with the code and your modifications?
yes, you can fork the code and code in your cloned repo first, and create a PR here when it is almost ready.
If there are any problems/questions, you can ask me directly in this issue.
BTW, it is better to start from master, not the sparse_bin_int64, as there are many new changes in master branch.
ok, i will try my best.
hey, I read your commit "5a5b26d" of your branch and I saw that you changed data_size_t to from int32_t to int64_t directly. And also you changed some "int" to "data_size_t" and some "data_size_t" to "int". I understand that you cast some "data_size_t" to "int" since now "data_size_t" is bigger. But why do you change some "int" to "data_size_t"? I don't understand this part.
Also, it would be highly appreciated if you can elaborate more on the "the overhead of int64_t as data_size_t will be smaller." Is it because that "int32_t" is not big enough in some cases but you have to compromise to use it since the overhead of using int64 in the cnt related part is unbearable?
@Adam1679 some of int should be data_size_t, so i fix them.
For the overheads:
1) the cnt in histogram, this is removed. So you don't need to care it any more.
2) the data_size_t in src/tree_learner/DataPartition.hpp . DataPartition stores the used row indices for each leaf. These indices need to be accessed when ConstructHistogram and Split. Changing it to int64_t will increase the memory cost. So I think we should use template here, for the small data, LightGBM will auto use int32_t (even int16_t).
Sorry that I left for a long time. I'm trying to figure out a solution for such kind of template. But I did a lot of stack overflow search and found that there is no solution for such kind of runtime template. (template is compile time). So i think what we need is polymorphism. for example, construct a base class and subclass it using different data size. But this may result in other problems like the base class cannot have any fields because the type is undetermined. Am I on the right track?
Most helpful comment
ok, i will try my best.