Xgboost: Data too big to load into memory

Created on 4 Feb 2015  路  10Comments  路  Source: dmlc/xgboost

Try to load a 5G csv. file into numpy.array and feed to DMatrix, of course it doesn't work.
It barely work when I divide the data into 10 chunks and use one of them, which is only around 500M. My machine has 8G memory, am I doing anything wrong?
Also, is there easy ways to divide the data into 10 chunks and handle them one by one with xgboost in a single machine?

python

Most helpful comment

There will be duplicated memory copy between numpy and xgb.DMatrix, if you want to save memory, try to convert the data into LibSVM format and xgb.DMatrix support direct loading from the textformat without involving numpy. Or simply use a machine with more RAM.

Since some internal data structure will be allocated, the required memory cost might be bigger than the file size.

Unlike SGD and online updating algorithms for linear model, you will need all the data to be in dmatrix so far to learn the model, and it is hard to build an algorithm that learns by looking at sub-chunks

All 10 comments

There will be duplicated memory copy between numpy and xgb.DMatrix, if you want to save memory, try to convert the data into LibSVM format and xgb.DMatrix support direct loading from the textformat without involving numpy. Or simply use a machine with more RAM.

Since some internal data structure will be allocated, the required memory cost might be bigger than the file size.

Unlike SGD and online updating algorithms for linear model, you will need all the data to be in dmatrix so far to learn the model, and it is hard to build an algorithm that learns by looking at sub-chunks

@YunshuLiu FYI I use the R package of xgboost and am able to use 2,5GB csv without issue on 8 GB laptop and without using libfm format or any temp file. The only part requiring careful process is the dummiification of the categorical data which requires smart treatment (everything is done in ram). If you know R language, it may be a solution for you.

Kind regards,
Micha毛l

@YunshuLiu @tqchen Maybe blaze can help? https://github.com/ContinuumIO/blaze

@datnamer It have nothing to do with external memory data structure, but have things to do with learning algorithm that requires all the data in batch way

@tqchen I know Rcpp can share memory with R and C++. Is it also true for xgboost R package? Does it work with big file because xgboost doesn't duplicate data unlike Python? (just curious to understand)

@pommedeterresautee the data is always duplicated internally because internal data structure is different from the matrix data structure in R. I think python one should also work fine, but that could due to the bad memory management in python

Thanks Tianqi, I converted the csv dataset into libsvm format directly and feed the directory of it into DMatrix. It is much better now. It seems to me 1-1.5G of data can fit into memory and not using swap at all. For the 5G data, in addition to all of the 8G of memory, 13G of swap is used, so training is slower. I guess I will need to use a workstation with enough memory to do cross validation.

I never used gradient boosting decision tree before, does it alway help to convert category feature to binary ones with one-hot encoding? The thing for the dataset I am current loading is that even it has several numerical features, they only takes values from a finite set(e.g. one feature only take values from 2,3,5,10,12), so for the algorithm you implemented, which way is better, to treat them as numerical value or as category value and encode them as binary ones?

By the way, in
xgmat_train = xgb.DMatrix(X_train, label=y_train,missing=-999)
what format of label is expected, I try to feed the directory of a csv file with only one column of 1 and 0, it wouldn't work, show it need float value

@YunshuLiu it depends of the meaning of the number. Does it make sense to say 5 > 3 > 2 ? Or does these numbers are different IDs ? If numerical values are real measure of things with numbers, you should not convert them to categorical data. THerefore Xgboost can guess rules like Feature X > 3 and it would include 5, 10 and 12. If they are IDs, you should convert it because it has no real meaning to say Feature X > 3 for IDs (IDs is just an example of categorical values).

This thread will be related to #244

Was this page helpful?
0 / 5 - 0 ratings