Incubator-mxnet: Support Sparse Matrix Input

Created on 2 Dec 2015 · 12Comments · Source: apache/incubator-mxnet

If one has categorical variables, that needs to be 1-hot encoded and the resulting matrix can be 100x or even 1000x bigger if dense rather than sparse representation is used. It seems mxnet currently supports only dense, and therefore it's easy to hit RAM limitations even with fairly small datasets on the largest EC2 GPU box g2.8xlarge with 60GB RAM.

For example a 10M row sample of the well knows airline dataset is ~2GB in sparse representation, but over 60GB in dense.

More info here:
https://github.com/szilard/benchm-ml/issues/30
and some code here:
https://github.com/szilard/benchm-ml/blob/master/4-DL/2-mxnet.R

Call for Contribution Feature request

Source

szilard

Most helpful comment

@winstywang expand it to dense is too expensive. I suggested save csr index and content in NDArray, and make a cuSparse Op to calculate with NDArrays.

antinucleon on 2 Dec 2015

👍2

All 12 comments

@tqchen: you are fast :)

szilard on 2 Dec 2015

This could be helpful for some data science tasks. We had a CPU version in cxxnet, which could possibly be adopted, we can track with this issue

tqchen on 2 Dec 2015

We are working on Pickle iterator now. In this case, you can load the data store in sparse numpy ndarray format, and expand it to dense for each batch. This is a workaround just for now before mx ndarray supports sparse operations

winstywang on 2 Dec 2015

@winstywang expand it to dense is too expensive. I suggested save csr index and content in NDArray, and make a cuSparse Op to calculate with NDArrays.

antinucleon on 2 Dec 2015

👍2

@antinucleon Yes, that is a more elegant solution. That is why I call it a workaround... At least, we could solve the memory issue first. We only need to store one batch of dense data.

winstywang on 2 Dec 2015

I'd love to see sparse support in the R package too (using Matrix)

I'd be happy to help write related R code, if needed.

zachmayer on 2 Dec 2015

👍1

What do you think of this pr https://github.com/BVLC/caffe/pull/2364 about sparse matrix In Caffe? I test the sparse innerproduct layer with some sparse data which has more than 200000 dimensions features. It brings about 300x faster speed and far less storage overhead.