Xgboost: Data too big to load into memory

Created on 4 Feb 2015 · 10Comments · Source: dmlc/xgboost

Try to load a 5G csv. file into numpy.array and feed to DMatrix, of course it doesn't work.
It barely work when I divide the data into 10 chunks and use one of them, which is only around 500M. My machine has 8G memory, am I doing anything wrong?
Also, is there easy ways to divide the data into 10 chunks and handle them one by one with xgboost in a single machine?

python

Source

YunshuLiu

Most helpful comment

There will be duplicated memory copy between numpy and xgb.DMatrix, if you want to save memory, try to convert the data into LibSVM format and xgb.DMatrix support direct loading from the textformat without involving numpy. Or simply use a machine with more RAM.

Since some internal data structure will be allocated, the required memory cost might be bigger than the file size.

Unlike SGD and online updating algorithms for linear model, you will need all the data to be in dmatrix so far to learn the model, and it is hard to build an algorithm that learns by looking at sub-chunks

tqchen on 4 Feb 2015

👍9

All 10 comments

Since some internal data structure will be allocated, the required memory cost might be bigger than the file size.

tqchen on 4 Feb 2015

👍9

@YunshuLiu FYI I use the R package of xgboost and am able to use 2,5GB csv without issue on 8 GB laptop and without using libfm format or any temp file. The only part requiring careful process is the dummiification of the categorical data which requires smart treatment (everything is done in ram). If you know R language, it may be a solution for you.

Kind regards,
Michaël

pommedeterresautee on 4 Feb 2015

@YunshuLiu @tqchen Maybe blaze can help? https://github.com/ContinuumIO/blaze

datnamer on 4 Feb 2015

@datnamer It have nothing to do with external memory data structure, but have things to do with learning algorithm that requires all the data in batch way

tqchen on 4 Feb 2015

@tqchen I know Rcpp can share memory with R and C++. Is it also true for xgboost R package? Does it work with big file because xgboost doesn't duplicate data unlike Python? (just curious to understand)

pommedeterresautee on 4 Feb 2015

@pommedeterresautee the data is always duplicated internally because internal data structure is different from the matrix data structure in R. I think python one should also work fine, but that could due to the bad memory management in python

tqchen on 4 Feb 2015

Thanks Tianqi, I converted the csv dataset into libsvm format directly and feed the directory of it into DMatrix. It is much better now. It seems to me 1-1.5G of data can fit into memory and not using swap at all. For the 5G data, in addition to all of the 8G of memory, 13G of swap is used, so training is slower. I guess I will need to use a workstation with enough memory to do cross validation.

I never used gradient boosting decision tree before, does it alway help to convert category feature to binary ones with one-hot encoding? The thing for the dataset I am current loading is that even it has several numerical features, they only takes values from a finite set(e.g. one feature only take values from 2,3,5,10,12), so for the algorithm you implemented, which way is better, to treat them as numerical value or as category value and encode them as binary ones?

YunshuLiu on 4 Feb 2015

By the way, in
xgmat_train = xgb.DMatrix(X_train, label=y_train,missing=-999)
what format of label is expected, I try to feed the directory of a csv file with only one column of 1 and 0, it wouldn't work, show it need float value

YunshuLiu on 4 Feb 2015

@YunshuLiu it depends of the meaning of the number. Does it make sense to say 5 > 3 > 2 ? Or does these numbers are different IDs ? If numerical values are real measure of things with numbers, you should not convert them to categorical data. THerefore Xgboost can guess rules like Feature X > 3 and it would include 5, 10 and 12. If they are IDs, you should convert it because it has no real meaning to say Feature X > 3 for IDs (IDs is just an example of categorical values).

pommedeterresautee on 4 Feb 2015

This thread will be related to #244

tqchen on 19 Apr 2015

Was this page helpful?

0 / 5 - 0 ratings