Lightgbm: [feature request] Chunked training dataset creation

Created on 27 Dec 2017 · 24Comments · Source: microsoft/LightGBM

Currently, LightGBM optimizes memory footprint of input feature values by transferring float32 / float64 values into their discreet representations (e.g. int8 for max_min=255). This is very helpful for fitting very large datasets into RAM for training. Theoretically, one can fit 1TB worth of float32 data into 256GB actual RAM.

However, as far as I can see it, current data load paradigm prevents user from utilising this capacity (at least in python module), by requiring him to create a single training Dataset, which necessitates having all initial float32 data in RAM in the first place, in order to pass to Dataset constructor.

The general solution would involve allowing user to create several Dataset objects, to be passed to booster for training simultaneously. This way the user can load a chunk of data from disk, compress it by creating the Dataset, free raw data, process the next chunk etc. I understand this is currently impossible as a consequence of binning approach: we need initial data in its entirety to discretize features. I suppose some smart approximation can be done for this general case.

However, as a good first-order workaround, I suggest the bins to be generated from first chunk only, and next chunks of training data simply being binned according to existing bins. I understand this is how validation data is processed at the moment (see set_reference() function in the Dataset), so it will be merely a question of passing reference=chunk_0 to chunk_1, chunk_2, .... If the user has a RAM issue, his dataset is really big, which implies that the first chunk will probably provide a very close approximation to true quantiles of the feature distribution (certainly for ~255 quantiles / max_bin!). This should also be very easy to implement.

Source

aftersought

Most helpful comment

@guolinke Thanks, that was actually very helpful suggestion. Dataset.construct is the key.

As a summary of the thread, here are a number of general suggestions for those looking to optimize their memory footprint when reading from text files is not a practical option for performance reasons.

Read your data into float32 not float64 (float not double in C) - you probably don't need the precision and you will squeeze twice as more in available RAM.
Construct dataset with free_raw_data=True as the docs suggest. Basic training does not need a copy of the data (and if some of your others functions need it you will get an error).
Delete you own copy of the data after calling Dataset constructor.
Call Dataset.construct() to actually construct dataset from data, which compresses data into bins and delets internal copy. You probably have to be careful with the order of (2) and (3) in C, but in python the data will only be removed after Dataset deletes it's own copy.
Process the validation set in the similar fashion only after you have compressed the training set / deleted original data (steps 1-3 above) and reduced your RAM footprint. As the training set is typically largest single dataset among all the datasets you will be working with you should start with it to have most memory available.
Once you process all you datasets in this fashion and start training, you RAM usage will be a fraction of your maximum available RAM (and peak RAM usage for dataset construction) due to discretization / compression of the data by LightGBM.

Apart from these basic steps, the reasonable thing to do is just write you own IO as the underlying C code seems to be pretty straightforward. I will edit this statement later with a more specific guidance when I have a chance to do it myself and see potential pitfalls.

aftersought on 27 Dec 2017

👍5 🎉1

All 24 comments

It's easier to write to file and use CLI in this case with two round loading. Also you could even use R/Python for training, as you can load binary datasets from the wrappers.

Laurae2 on 27 Dec 2017

@Laurae2 Yes, that could probably work, but still the second, third etc. chunks have to use the bins of the first chunk, otherwise it will be impossible to combine them in a single training Dataset as far as I see it.

If you think this is doable with current wrappers, could you please share a code snippet (possibly with a bit of pseudocode thrown in...) saving the compressed (discretized) Dataset to file and loading several chunks saved this way into a single large dataset?

aftersought on 27 Dec 2017

@aftersought I think two_round solution is better than chunk solution.
If memory is not enough, both CLI/Python/R should load dataset from file. In that case, you can set two_round=True in both CLI/Python/R.

For the details of two round loading, you can refer to this: https://github.com/Microsoft/LightGBM/issues/1138#issuecomment-353857567

guolinke on 27 Dec 2017

@guolinke I am not sure I completely get this. Suppose I do

train_data = lgb.Dataset('train.svm.txt')
train_data.save_binary('train.bin')

Does the binary file now contain only bin indexes or it is still full floating-point data?

Also, even if I split data across several compressed (disctretized) datasets, and manage to align bins, how do I tell Booster (or Dataset) to concatenate several of such files in training?

I think the problem here is that there is no reasonably technical discussion of how the binning process works. From what I see, the following happens:

When Dataset is called, engine copies the data over to some internal structures, doubling up RAM usage. The raw data going in can then be deleted (keep_raw_data=False) to bring RAM back to 1x expected level.
When Booster is called, it constructs bin_mapper according to 'max_bin`. If 'two_round' is not set, this is done by using a lot of additional RAM, otherwise the histogram is constructed first and the data discretized line-by-line. The original data is not deleted, even though it won't be used for training.

This is just a wild guess, and I would appreciate if you can provide a description of this kind to trace significant RAM allocations, and suggest an optimal pathway for a large initial feature data on disk.

aftersought on 27 Dec 2017

@aftersought Then binary data only contains bin indexes.

We cannot support concat mulit dataset now.

I think the memory blow-up is in python-side, not in LightGBM core algorithm. e.g. https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L226 or https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L744-L748

For the LightGBM actual memory cost, you should use CLI version to load data from file.
If two_round is not set, the additional memory is the size of data file.

guolinke on 27 Dec 2017

Ok now that I see that Dataset uses params as well, this becomes much clearer.

While python (and numpy/pandas in particular) can be memory-inefficient, if one uses numpy 2d array (your second link) there should not be any overhead for float32 / float64 data as reshape does not generally copy data (i.e. reshaping is merely interface change, the actual numpy data is always a single long array in memory). Indeed, I do not observe any overhead in practice at this stage (and besides it would not be critical if data could be loaded in chunks).

I don't understand why concatenating several binary files (discretized according to the same grid) should be such a problem. In fact, if it is just a plain binary file one can presumably concatenate it on disk? Without this feature, I am not sure, even with CLI version, how is one supposed to work with say 500GB floating-point data having only 256GB of RAM. Natural workflow would be to load data in ~200GB chunks (on user side), pass it to Dataset chunk by chunk to compress, freeing floating-data chunks after each compression, and concatenate all the discrete datasets afterwards (possibly with OS tools, if this is just discrete bin index as you say). Are there any other suggestions to work with larger-than RAM floating-point data?

aftersought on 27 Dec 2017

If you have 500GB of fp32 or fp64 full data in a file, it will take significantly less memory when loaded properly in R or Python. For instance, Higgs 11M dataset (8GB CSV file) can be loaded and trained quickly using R and a 4GB potato laptop.

Laurae2 on 27 Dec 2017

@Laurae2 8GB CSV file is not the same thing as 8GB binary file containing floats. Or did you mean something else?

P.S.: To elaborate on this, clearly when I say "500GB floating-point data" I mean data as it will be represented in RAM, in a contiguous C-array of floats / doubles. Naturally, it can take more or less when stored on disk, depending on many factors, but unless you support disk-mapped numpy arrays or some functional equivalent I do not see how this would be relevant?

aftersought on 27 Dec 2017

I don't know how you can have a 250GB binary file. Even my 3.5TB CSV files are under 50GB in LightGBM binary file format.

Laurae2 on 27 Dec 2017

@Laurae2 As far as I see it, LightGBM uses 1 byte to code bin index (up to 255). That means that 10^9 records of ~250 features would take around 250GB, which is the actual size of the dataset I would like to work with.

Needless to say, CSV would be rather poor storage choice for such a dataset :)

aftersought on 27 Dec 2017

using two_round is the most memory efficient solution. as a result, I think concats solution is not need.

guolinke on 27 Dec 2017

@guolinke Could you please elaborate? To use two_rounds, I need to call Dataset, so I need to provide a single numpy (or CLI) array, which does not fit into RAM.

aftersought on 27 Dec 2017

numpy array with two_round cannot help. You should save it to csv/lightsvm first, then load from file with two_round.

You can save multi files if main-memory is not enough, and the concat of multi csv/lightsvm files is easy.

guolinke on 27 Dec 2017

Ok, I notice python doc says:

_data (string, numpy array or scipy.sparse) – Data source of Dataset. If string, it represents the path to txt file._

Clearly csv / txt files are not a good option for a sufficiently large dataset. Can you elaborate on lightsvm format, as it does not seem to be covered in the docs or elsewhere on the internet?

aftersought on 27 Dec 2017

@aftersought you can't bypass wrappers' limitations if you want to use wrappers. As said from the beginning, convert your file first, then train with the converter file second. If memory usage is still too large, then use two_rounds parameter to lower RAM usage. If memory usage is again too large, you can limit LightGBM memory usage for training (look in the parameter doc for that). If it is still too large again, use CLI directly.

Laurae2 on 27 Dec 2017

@aftersought SVMLight format is not LightGBM proprietary format. You can find documentation about that format online (it is well documented and very simple), it is used for sparse datasets.

Laurae2 on 27 Dec 2017

@Laurae2 Is this the one? In looks like another text format, which seems to be even worse choice than csv for reasonably large datasets. Is there a binary variation that I have missed?

I am not against converting to any suitable format myself before loading to LightGBM of course, but it has to be a format suitable for large data files, at minimum a binary format.

aftersought on 27 Dec 2017

Yes, it is this one. If your dataset is dense enough you will actually increase the dataset size.
In any case, to be read by LightGBM, a file must be either on disk (in the appropriate format) or in memory. If your dataset is very dense (not sparse enough), then you must find a way to convert your binary dataset to a supported format (otherwise LightGBM cannot read it). CSV/SVMLight, or binary LightGBM are possible choices.

You can also rent an AWS/GCP instance to perform the transformation if you cannot do it locally.

Laurae2 on 27 Dec 2017

@Laurae2 This is all pretty straightforward and as I said I am not against doing reasonable conversion myself (and I am not sure why would I ever need AWS/GCP for that sort of thing, the memory footprint should be non-existent if done properly in C).

However, and this is very uncontroversial as I see it, text formats are rather poor choice for big data applications. There are many alternatives out there (from simple binary formats to sophisticated formats like HDF5), that are widely used in industry and academia.

Would not you agree then that it would make sense to support something else apart from text formats, especially in the library supposedly geared towards heavy-duty workload?

Edit: I see you mentioned "binary LightGBM is possible choice". Is it original floats or resulting bin indices? Because if it's the latter then I am not sure how it can be used if I want to rely on LightGBM to do the discretization (in two-pass algorithm mentioned earlier).

aftersought on 27 Dec 2017

@aftersought
Though text data are not storage-efficient, but it is a work-around solution for your problem.

For the binary formats, I didn't see a common-accept format. And most communities are using the text format. Most of open data sets are text format as well.

IO is always the most heavy part in a algorthm, and we don't have resources to support a new IO format which is not widely needed.

For the LightGBM dataset binary format, it only store the bins.
Here is the code of save_binary: https://github.com/Microsoft/LightGBM/blob/master/src/io/dataset.cpp#L512-L597

you can follow the code to merge multi binary files manually if you really need it:

you should merge num_data
you should merge metadata
you should merge feature_groups

guolinke on 27 Dec 2017

@guolinke
While it is true there is no binary standard as commonplace as csv data, _any_ binary choice would offer better performance.

I am surprised you say binary IO is hard to develop / maintain, but it's your resources so let's leave it at that.

The code is self-explanatory, the problem as I see it is that it is impossible to merge the binned data if they do not align on bin thresholds, and it's impossible to generate those thresholds correctly if one doesn't sample from the whole data array in two-pass algorithm. This is the reason I am asking for some efficient IO in the first place.

I will have a look into how you read your data in two-pass algorithm and consider creating a local fork with binary IO which I suspect will be easier and cleaner option (depending on how IO is structured in the first place in C code).

aftersought on 27 Dec 2017

@guolinke While we are on the topic of the RAM optimisation, I notice this behaviour which does not look consistent with Dataset keeping only bin indices in RAM. Basically, I am doing train_set=Dataset(...) , then valid_set=Dataset(...), then booster = Booster(train_set, valid_set), and print out RAM usage by app in between:

Creating training dataset ...
156 GB RAM used
Creating validation dataset ...
193 GB RAM used
Creating booster ...
60 GB RAM used

First two values are consistent with full binary floating-point data hanging somewhere in memory, while the last reading is consistent with binned representation. So it appears that the compression is only happening when Booster is created. Is that so, or this is an artefact of Python wrapper?

P.S.: Naturally, I set free_raw_data=True in Dataset constructor and do lgbTrain.raw_data = None, raw_data = None and gc.collect() afterwards, though I agree python can still theoretically disregard all this.

aftersought on 27 Dec 2017

@aftersought
I/O part isn't hard, it is just not trivial for high efficiency, and often needs many codes. You can find there are many IO codes in LightGBM, while the core algorithm is quite simple.

you can use reference parameter in Dataset or create_valid to ensure the alignment.

The Dataset is initialized lazily: https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L625-L734

you can call Dataset.costruct() to construct immediately (but don't forget to set reference for validation dataset).

guolinke on 27 Dec 2017

@guolinke Thanks, that was actually very helpful suggestion. Dataset.construct is the key.

Read your data into float32 not float64 (float not double in C) - you probably don't need the precision and you will squeeze twice as more in available RAM.
Construct dataset with free_raw_data=True as the docs suggest. Basic training does not need a copy of the data (and if some of your others functions need it you will get an error).
Delete you own copy of the data after calling Dataset constructor.
Call Dataset.construct() to actually construct dataset from data, which compresses data into bins and delets internal copy. You probably have to be careful with the order of (2) and (3) in C, but in python the data will only be removed after Dataset deletes it's own copy.
Process the validation set in the similar fashion only after you have compressed the training set / deleted original data (steps 1-3 above) and reduced your RAM footprint. As the training set is typically largest single dataset among all the datasets you will be working with you should start with it to have most memory available.
Once you process all you datasets in this fashion and start training, you RAM usage will be a fraction of your maximum available RAM (and peak RAM usage for dataset construction) due to discretization / compression of the data by LightGBM.