I have built many models, some of them are big 300M file or bigger (10k trees ie).
In such cases predict phase is slow (not too much but still, when one is using stacking with cross validation this can slow done prediction calculation badly ~2hours).
I find out it is not problem of calling model.predict() this is reasonable fast.
Problem is loading model from disk:
model = lg.Booster(model_file = workingDir + '/modely/model_' + str(cv) + '_' + str(sc) + '.txt')
Is there any way to speed this up?
There is save_binary property, but only for datasets.
I am saving models with:
model.save_model(working_dir + '/modely/model_' + str(cv) + '_' + str(sc) + '.txt', num_iteration = model.best_iteration)
thx for any help..
@guolinke Do you think a binary format for models is appropriate? (and to add to the API)
This is what xgboost does for speeding up model saving/loading.
Just for comparison:
I tested on a private dataset with 10,000 iterations and 256 leaves, using a 2GBps PCI-E SSD to make sure the SSD is not a bottleneck, run 10x, reported in milliseconds:
| Model | Binary | Time to load | Time to save | Prediction Time | File size |
| --- | ---: | ---: | ---: | ---: | ----: |
| xgboost
(.save) | Yes (Any) | 598 | 411 | 3383 | 127,273,371 bytes |
| xgboost
(.RDS) | Yes (R) | 789 | 6592 | 3568 | 46,122,319 bytes |
| LightGBM
(.save) | No | 9206 | 24289 | 2147 | 133,351,088 bytes |
| LightGBM
(.RDS) | Yes (R) | 15451 | 34731 | 2146 | 47,424,458 bytes |
N.B: xgboost RDS got a consistent prediction speed loss (tested 50 times .save and .RDS prediction time), but is usable as is unlike LightGBM (which makes uses of .save/.load indirectly via RDS to be re-usable).
@Laurae2 yes. The binary model format is needed. But I am busy with other things recently.
So I add a call of contribution first.
Welcome to contribution 馃槃
@guolinke
Do you have any suggestions about binary file format? Is it an issue for a discussion?
There is no flag to indicate that binary or text mode required in current interface of GBDT::SaveModelToFile(int num_iteration, const char* filename) method.
How to understand what format to use?
@limexp
we only have text format now. Binary format is a to-do item.
@guolinke
I understand this and want to clarify the task and estimate the impact of it on existing programs before starting to code. It is better to decide beforehand than to modify afterwards.
It's hard to change binary file format in the future and not to ruin existing saved models, or it would require to add versions support and make code complex. Of course there are universal solutions like protobuf.
Decision about interface is not so critical, but it has direct impact on codebase, tests and compatibility.
I'm not asking for a final solution but looking for a direction if you have one.
AFAIU, concerning the benchmark:
One of the possible reasons behind high "Time to save" is the stringstream perfomance. It looks like it might be ~10x slower than += (even though it is a bit counter-intuitive). This might vary from OS to OS and from compiler to compiler though (those benchmark uses gcc 4.6). The += also doesn't suffer from "2GB limit" in MSVS.
Existing save\load code is single-threaded. It should be fairly easy to parallelize it (easier than designing a binary format). More than 99% percent of the file is occupied by Tree parts, and it should be easy to create them in parallel. (No one is going to use LGBM to create just one insanely huge tree, right?) Still, it would be hard to beat xgboost this way.
Existing "convertion to string" is also used for some other purposes, (e.g. here), not just for file saving. If might be a good idea to use new serialization method in such cases as well, but this would make it even more complicated. (If it would be only for saving to files - hdf5 might look like a good choice.)
anyone interested in helping test loading and saving model with protobuf?
it's in branch: https://github.com/wxchan/LightGBM/tree/proto, you can cmake with -DUSE_PROTO=ON to install with protobuf, and add model_format=proto in config to load and save models with protobuf.
a simple test:
text ->proto
save: 0.612300s -> 0.023627s
load: 0.471999s -> 0.024712s
size: 13M -> 5.9M
I personally find the text format the best feature of lightgbm. You can easily check things like how many trees and so on are being used without any additional commands which is complicated know binary models
Seems like the protobuf model has been merged and so this issue can be closed ?
it might be reverted, we are looking for a better solution @AbdealiJK
I think model read/write is much faster now. Please have a try.
yes, it is much faster now... great job
Any solution to this? I have trained and saved a LGB model and the file is almost 18GB.
Most helpful comment
@guolinke Do you think a binary format for models is appropriate? (and to add to the API)
This is what xgboost does for speeding up model saving/loading.
Just for comparison:
I tested on a private dataset with 10,000 iterations and 256 leaves, using a 2GBps PCI-E SSD to make sure the SSD is not a bottleneck, run 10x, reported in milliseconds:
| Model | Binary | Time to load | Time to save | Prediction Time | File size |
| --- | ---: | ---: | ---: | ---: | ----: |
| xgboost
(.save) | Yes (Any) | 598 | 411 | 3383 | 127,273,371 bytes |
| xgboost
(.RDS) | Yes (R) | 789 | 6592 | 3568 | 46,122,319 bytes |
| LightGBM
(.save) | No | 9206 | 24289 | 2147 | 133,351,088 bytes |
| LightGBM
(.RDS) | Yes (R) | 15451 | 34731 | 2146 | 47,424,458 bytes |
N.B: xgboost RDS got a consistent prediction speed loss (tested 50 times .save and .RDS prediction time), but is usable as is unlike LightGBM (which makes uses of .save/.load indirectly via RDS to be re-usable).