Lightgbm: [feature request] Support int64_t data_size_t

Created on 24 Feb 2020  路  9Comments  路  Source: microsoft/LightGBM

As the cnt in the histogram is removed, the overhead of int64_t as data_size_t will be smaller.

There are several todo items:

  • [ ] remove data_size_t completely in code, directly used int64_t
  • [ ] Template types for the indices in DataPartition, to achieve the same performance as before when the data_size_t is smaller than int32_t.
  • [ ] update related c_api, and Python / R interfaces
feature request help wanted in progress

Most helpful comment

ok, i will try my best.

All 9 comments

I have a branch (https://github.com/microsoft/LightGBM/tree/sparse_bin_int64) working on this, will revisit it when have time.

hey, can i help this one?

@Adam1679 yeah, very welcome

sorry, this is my first time to make public contribution. Any suggestions about how to begin this one? Should I fork your branch and get familiar with the code and your modifications?

yes, you can fork the code and code in your cloned repo first, and create a PR here when it is almost ready.
If there are any problems/questions, you can ask me directly in this issue.
BTW, it is better to start from master, not the sparse_bin_int64, as there are many new changes in master branch.

ok, i will try my best.

hey, I read your commit "5a5b26d" of your branch and I saw that you changed data_size_t to from int32_t to int64_t directly. And also you changed some "int" to "data_size_t" and some "data_size_t" to "int". I understand that you cast some "data_size_t" to "int" since now "data_size_t" is bigger. But why do you change some "int" to "data_size_t"? I don't understand this part.

Also, it would be highly appreciated if you can elaborate more on the "the overhead of int64_t as data_size_t will be smaller." Is it because that "int32_t" is not big enough in some cases but you have to compromise to use it since the overhead of using int64 in the cnt related part is unbearable?

@Adam1679 some of int should be data_size_t, so i fix them.

For the overheads:
1) the cnt in histogram, this is removed. So you don't need to care it any more.
2) the data_size_t in src/tree_learner/DataPartition.hpp . DataPartition stores the used row indices for each leaf. These indices need to be accessed when ConstructHistogram and Split. Changing it to int64_t will increase the memory cost. So I think we should use template here, for the small data, LightGBM will auto use int32_t (even int16_t).

Sorry that I left for a long time. I'm trying to figure out a solution for such kind of template. But I did a lot of stack overflow search and found that there is no solution for such kind of runtime template. (template is compile time). So i think what we need is polymorphism. for example, construct a base class and subclass it using different data size. But this may result in other problems like the base class cannot have any fields because the type is undetermined. Am I on the right track?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ahbon123 picture ahbon123  路  4Comments

chivee picture chivee  路  3Comments

NicolasHug picture NicolasHug  路  3Comments

jameslamb picture jameslamb  路  3Comments

zanemarkson picture zanemarkson  路  3Comments