Incubator-mxnet: Sparse Matrix Specification

Created on 23 Feb 2016 · 18Comments · Source: apache/incubator-mxnet

Since there are some amount of demand on this #919 #773 This issue is for specification of Sparse Matrix support for mxnet. So interested devs can go ahead and do the job.

Data Format and Object

For now, we will be supporting CSR Matrix format.
- indptr : ndarray of uint64
- indices ndarray of uint32
- values ndarray of float

I think the first stage of wrapping will be done on the

First Step Operator

SparseInputEmbedding(data_indptr = indptr, data_indices = indices, data_value = values, weight=weight, bias = bias)
- Same as dense matrix input, but explicitly specifies three (dense)inputs
A wrapping can be done on the language native side.

Difference from Dense Matrix Op
Gradient won't be propagated back to the sparse matrix side.
Usually addto on gradient is much cheaper than writeto
This can limit type of optimizations. SGD can be fast if only touches the non-missing entries, SGD with momentum can be slow
SGD have to be specially implemented for sparse matrix

Transparent Interchangeability with Dense Matrix

I think the basic answer is no, mainly because it is hard to optimize for general sparse pattern(sparse sparse addition). We want to be able to handle sparse input really well.

Design Choices

This is what we like to discussion

choice 1: Make SparseInputEmbedding take dense input weight, and output dense weight gradient.
- This benefit from generality of the program, and can make use of optimizers.
- This suffers a lot from not fully utilizing the sparse pattern when batch size is small
choice 2: Make SparseUpdatingEmbedding take weight as auxiliary variable, also include parameters such as learning rate, and weight decay, will do explicit SGD update on weight
- This benefit a lot from utilizing the sparse pattern, especially on CPU.
- A bit hard to do data parallel on multiple GPU.

We have observed the same thing in EmbeddingOp for NLP, where users complain about its memory utilization due to choice 1. So I would expect different choices works for difference cases, and maybe we even need to have both, with clear naming convention for the second one

Call for Contribution Roadmap

Source

tqchen

👍4

Most helpful comment

What's the progress of this issue?

xlvector on 28 May 2016

👍6

All 18 comments

@antinucleon will be leading discussion on this issue

tqchen on 23 Feb 2016

TensorFlow has sparse tensors and indexedslices tensors, which are sparse in first dimension but dense in other dimensions.

I think I would be easier to support the later first and it satisfies most use cases.

piiswrong on 23 Feb 2016

👍1

Sgd with momentum can also benefit from sparse update by recording iteration count of the last update.

piiswrong on 23 Feb 2016

One other feature we are missing is stride and general slicing.

piiswrong on 23 Feb 2016

I vote for the first choice. However, I am thinking about wrap indptr, indices and values into CSR matrix directly, and make transparent convert to scipy.csr

antinucleon on 25 Feb 2016

@antinucleon ＃1 doesn't work for large vocabulary unless we support row sparse (or indexslices) tensors.

piiswrong on 25 Feb 2016

We have around 300M features but each sample has only around 300 non-zeros.
If batch size is small, dense weight and dense gradient will be wasteful.

On Thu, Feb 25, 2016 at 1:22 PM, Eric Junyuan Xie [email protected]
wrote:

＃1 doesn't really work for large vocabulary unless we support row sparse
(or indexslices) tensors.

—
Reply to this email directly or view it on GitHub
https://github.com/dmlc/mxnet/issues/1524#issuecomment-188620605.

HONG Chuntao
System Research Group
Microsoft Research Asia

hjk41 on 26 Feb 2016

What's the progress of this issue?

xlvector on 28 May 2016

👍6

no progress at all?

ProgramCaiCai on 22 Jul 2016

@ProgramCaiCai I've tried for supporting sparsity in MShadow. We may need two key functions. csrmm and csr2csc (Refer to http://docs.nvidia.com/cuda/cusparse/#cusparse-lt-t-gt-csrmm and http://docs.nvidia.com/cuda/cusparse/#cusparse-lt-t-gt-csr2csc). However, I meet some problems when I try to directly call the cuSparse APIs since the sparse matrix stored there is row-major while the dense matrix is column-major. For this reason I haven't continued the implementation. Following is the declaration of the csrmm function.

/*!
 * \brief CPU/GPU: dst = alpha * op(A) * op(B) + beta * dst;
                   op(A) = trans_A ? A^T : A; op(B) = trans_B ? B^T : B;
                   A is stored in Compressed Sparse Row Format (CSR)
                   Refer to https://en.wikipedia.org/wiki/Sparse_matrix
 * \param A_val value of the nnz elements in A
 * \param A_row_ptr pointer to row elements
 * \param A_col_ind column indices of the nnz elements
 * \param B dense matrix
 * \param alpha
 * \param beta
 * \param dst destination, during computation, the dst will be transposed
 */
template<bool trans_A, bool trans_B, typename DType>
inline void csrmm(const Tensor<gpu, 1, DType> &A_val,
                  const Tensor<gpu, 1, int> &A_row_ptr,
                  const Tensor<gpu, 1, int> &A_col_ind,
                  const Tensor<gpu, 2, DType> &B,
                  DType alpha,
                  DType beta,
                  Tensor<gpu, 2, DType> dst);

sxjscience on 22 Jul 2016

🎉1

no progress at all?

niuchenglei on 26 Aug 2016

Suggestion on the interface SparseInputEmbedding(data_indptr = indptr, data_indices = indices, data_value = values, weight=weight, bias = bias)
data_values should be optional. If not specified, assume all values are 1 -- this is a very common case.

leopd on 26 Sep 2016

We need to think about reading sparse inputs from disk too. Because each record will require variable storage on disk, depending on the number of non-zero entries. I see a couple of choices here:

1) We adopt a convention like "index=-1 means missing data" and require that all records get -1 padded out to the length of the longest record. This is simple to implement, and general, but space-inefficient.

2) Write custom InputIter classes that understand how to read variable-length records.

leopd on 7 Oct 2016

I/O option #1 (padding with -1's) gets much more attractive if the files are gzip encoded. That's nice in that it's also super general. (Thanks for the suggestion NB.)

leopd on 7 Oct 2016

when input vector is sparse, the vector's length is varying with different input. However, the parameter in_args(NDArrayHandle *)'s memories in function MXExecutorBindEX are pre-allocated and have fixed length. Obviously, the two are contradict.

moveforever on 4 Nov 2016

hi, @tqchen @antinucleon, I understand that the initialization process of Static Graph have allocated memory for input or output variable, and that's static allocate memory. When the input is sparse, the using memory is varying. So I wonder static allocate memory can support sparse input. Recently, I want to support sparse input, Can you help me?

moveforever on 5 Nov 2016

@moveforever I agree these two ideas contract: 1) sparse inputs have variable length, because the actual vector length is the number of non-zero entries in that sparse record; 2) pre-allocating fixed memory buffers. Also 3) minibatches, which acts a lot like (2).

I think a good way to solve this is by adopting a convention that a value of "-1" in the index input indicates that this entry should actually be treated as blank (sparse). This means that sparse input data will have to be padded out to a fixed size of maximum number of non-zero entries.

leopd on 13 Nov 2016

+1 Sparse Matrix support is also needed for Keras. #4173