Lightgbm: Binary format needed - slow model reading

Created on 29 Mar 2017 · 13Comments · Source: microsoft/LightGBM

I have built many models, some of them are big 300M file or bigger (10k trees ie).

In such cases predict phase is slow (not too much but still, when one is using stacking with cross validation this can slow done prediction calculation badly ~2hours).

I find out it is not problem of calling model.predict() this is reasonable fast.
Problem is loading model from disk:

model = lg.Booster(model_file = workingDir + '/modely/model_' + str(cv) + '_' + str(sc) + '.txt')

Is there any way to speed this up?
There is save_binary property, but only for datasets.

I am saving models with:
model.save_model(working_dir + '/modely/model_' + str(cv) + '_' + str(sc) + '.txt', num_iteration = model.best_iteration)

thx for any help..

help wanted

Source

gugatr0n1c

👍1

Most helpful comment

@guolinke Do you think a binary format for models is appropriate? (and to add to the API)

This is what xgboost does for speeding up model saving/loading.

Just for comparison:

xgboost uses a binary format for fast save/load
LightGBM does not have (?) a binary format for fast load

I tested on a private dataset with 10,000 iterations and 256 leaves, using a 2GBps PCI-E SSD to make sure the SSD is not a bottleneck, run 10x, reported in milliseconds:

| Model | Binary | Time to load | Time to save | Prediction Time | File size |
| --- | ---: | ---: | ---: | ---: | ----: |
| xgboost
(.save) | Yes (Any) | 598 | 411 | 3383 | 127,273,371 bytes |
| xgboost
(.RDS) | Yes (R) | 789 | 6592 | 3568 | 46,122,319 bytes |
| LightGBM
(.save) | No | 9206 | 24289 | 2147 | 133,351,088 bytes |
| LightGBM
(.RDS) | Yes (R) | 15451 | 34731 | 2146 | 47,424,458 bytes |

N.B: xgboost RDS got a consistent prediction speed loss (tested 50 times .save and .RDS prediction time), but is usable as is unlike LightGBM (which makes uses of .save/.load indirectly via RDS to be re-usable).

Laurae2 on 29 Mar 2017

👍4

All 13 comments

@guolinke Do you think a binary format for models is appropriate? (and to add to the API)

This is what xgboost does for speeding up model saving/loading.

Just for comparison:

xgboost uses a binary format for fast save/load
LightGBM does not have (?) a binary format for fast load

I tested on a private dataset with 10,000 iterations and 256 leaves, using a 2GBps PCI-E SSD to make sure the SSD is not a bottleneck, run 10x, reported in milliseconds:

Laurae2 on 29 Mar 2017

👍4

@Laurae2 yes. The binary model format is needed. But I am busy with other things recently.
So I add a call of contribution first.
Welcome to contribution 😄

guolinke on 29 Mar 2017

@guolinke
Do you have any suggestions about binary file format? Is it an issue for a discussion?

There is no flag to indicate that binary or text mode required in current interface of GBDT::SaveModelToFile(int num_iteration, const char* filename) method.
How to understand what format to use?

Always use binary format
Add new method for binary saving
Change interface and add a parameter with a default value
Use new/existing config parameter like save_binary (is_save_binary_file)
...

limexp on 8 Aug 2017

@limexp
we only have text format now. Binary format is a to-do item.

guolinke on 9 Aug 2017

@guolinke
I understand this and want to clarify the task and estimate the impact of it on existing programs before starting to code. It is better to decide beforehand than to modify afterwards.

It's hard to change binary file format in the future and not to ruin existing saved models, or it would require to add versions support and make code complex. Of course there are universal solutions like protobuf.

Decision about interface is not so critical, but it has direct impact on codebase, tests and compatibility.

I'm not asking for a final solution but looking for a direction if you have one.

limexp on 9 Aug 2017

AFAIU, concerning the benchmark:

One of the possible reasons behind high "Time to save" is the stringstream perfomance. It looks like it might be ~10x slower than += (even though it is a bit counter-intuitive). This might vary from OS to OS and from compiler to compiler though (those benchmark uses gcc 4.6). The += also doesn't suffer from "2GB limit" in MSVS.
Existing save\load code is single-threaded. It should be fairly easy to parallelize it (easier than designing a binary format). More than 99% percent of the file is occupied by Tree parts, and it should be easy to create them in parallel. (No one is going to use LGBM to create just one insanely huge tree, right?) Still, it would be hard to beat xgboost this way.
Existing "convertion to string" is also used for some other purposes, (e.g. here), not just for file saving. If might be a good idea to use new serialization method in such cases as well, but this would make it even more complicated. (If it would be only for saving to files - hdf5 might look like a good choice.)

i3v on 29 Aug 2017

anyone interested in helping test loading and saving model with protobuf?

it's in branch: https://github.com/wxchan/LightGBM/tree/proto, you can cmake with -DUSE_PROTO=ON to install with protobuf, and add model_format=proto in config to load and save models with protobuf.

a simple test:
text ->proto
save: 0.612300s -> 0.023627s
load: 0.471999s -> 0.024712s
size: 13M -> 5.9M

wxchan on 13 Sep 2017

I personally find the text format the best feature of lightgbm. You can easily check things like how many trees and so on are being used without any additional commands which is complicated know binary models