Incubator-mxnet: Support Sparse Matrix Input

Created on 2 Dec 2015  路  12Comments  路  Source: apache/incubator-mxnet

If one has categorical variables, that needs to be 1-hot encoded and the resulting matrix can be 100x or even 1000x bigger if dense rather than sparse representation is used. It seems mxnet currently supports only dense, and therefore it's easy to hit RAM limitations even with fairly small datasets on the largest EC2 GPU box g2.8xlarge with 60GB RAM.

For example a 10M row sample of the well knows airline dataset is ~2GB in sparse representation, but over 60GB in dense.

More info here:
https://github.com/szilard/benchm-ml/issues/30
and some code here:
https://github.com/szilard/benchm-ml/blob/master/4-DL/2-mxnet.R

Call for Contribution Feature request

Most helpful comment

@winstywang expand it to dense is too expensive. I suggested save csr index and content in NDArray, and make a cuSparse Op to calculate with NDArrays.

All 12 comments

@tqchen: you are fast :)

This could be helpful for some data science tasks. We had a CPU version in cxxnet, which could possibly be adopted, we can track with this issue

We are working on Pickle iterator now. In this case, you can load the data store in sparse numpy ndarray format, and expand it to dense for each batch. This is a workaround just for now before mx ndarray supports sparse operations

@winstywang expand it to dense is too expensive. I suggested save csr index and content in NDArray, and make a cuSparse Op to calculate with NDArrays.

@antinucleon Yes, that is a more elegant solution. That is why I call it a workaround... At least, we could solve the memory issue first. We only need to store one batch of dense data.

I'd love to see sparse support in the R package too (using Matrix)

I'd be happy to help write related R code, if needed.

What do you think of this pr https://github.com/BVLC/caffe/pull/2364 about sparse matrix In Caffe? I test the sparse innerproduct layer with some sparse data which has more than 200000 dimensions features. It brings about 300x faster speed and far less storage overhead.

The CSR matrix format is indeed what we should go

hi, guys, what's the status of this issue now? does the mxnet support the sparse data? Thanks.

@tqchen Are there plans to allow the R package to support sparse input too?

I would like to tackle this issue if it is open.

@rohit12 Have opened a detailed thread to discuss this in #1524

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ranti-iitg picture ranti-iitg  路  3Comments

dushoufu picture dushoufu  路  3Comments

seongkyun picture seongkyun  路  3Comments

Shiro-LK picture Shiro-LK  路  3Comments

dmadeka picture dmadeka  路  3Comments