If one has categorical variables, that needs to be 1-hot encoded and the resulting matrix can be 100x or even 1000x bigger if dense rather than sparse representation is used. It seems mxnet currently supports only dense, and therefore it's easy to hit RAM limitations even with fairly small datasets on the largest EC2 GPU box g2.8xlarge with 60GB RAM.
For example a 10M row sample of the well knows airline dataset is ~2GB in sparse representation, but over 60GB in dense.
More info here:
https://github.com/szilard/benchm-ml/issues/30
and some code here:
https://github.com/szilard/benchm-ml/blob/master/4-DL/2-mxnet.R
@tqchen: you are fast :)
This could be helpful for some data science tasks. We had a CPU version in cxxnet, which could possibly be adopted, we can track with this issue
We are working on Pickle iterator now. In this case, you can load the data store in sparse numpy ndarray format, and expand it to dense for each batch. This is a workaround just for now before mx ndarray supports sparse operations
@winstywang expand it to dense is too expensive. I suggested save csr index and content in NDArray, and make a cuSparse Op to calculate with NDArrays.
@antinucleon Yes, that is a more elegant solution. That is why I call it a workaround... At least, we could solve the memory issue first. We only need to store one batch of dense data.
I'd love to see sparse support in the R package too (using Matrix)
I'd be happy to help write related R code, if needed.
What do you think of this pr https://github.com/BVLC/caffe/pull/2364 about sparse matrix In Caffe? I test the sparse innerproduct layer with some sparse data which has more than 200000 dimensions features. It brings about 300x faster speed and far less storage overhead.
The CSR matrix format is indeed what we should go
hi, guys, what's the status of this issue now? does the mxnet support the sparse data? Thanks.
@tqchen Are there plans to allow the R package to support sparse input too?
I would like to tackle this issue if it is open.
@rohit12 Have opened a detailed thread to discuss this in #1524
Most helpful comment
@winstywang expand it to dense is too expensive. I suggested save csr index and content in NDArray, and make a cuSparse Op to calculate with NDArrays.