Incubator-mxnet: Sparse Matrix Specification

Created on 23 Feb 2016  Â·  18Comments  Â·  Source: apache/incubator-mxnet

Since there are some amount of demand on this #919 #773 This issue is for specification of Sparse Matrix support for mxnet. So interested devs can go ahead and do the job.

Data Format and Object

  • For now, we will be supporting CSR Matrix format.

    • indptr : ndarray of uint64

    • indices ndarray of uint32

    • values ndarray of float

I think the first stage of wrapping will be done on the

First Step Operator

  • SparseInputEmbedding(data_indptr = indptr, data_indices = indices, data_value = values, weight=weight, bias = bias)

    • Same as dense matrix input, but explicitly specifies three (dense)inputs

  • A wrapping can be done on the language native side.

    Difference from Dense Matrix Op

  • Gradient won't be propagated back to the sparse matrix side.

  • Usually addto on gradient is much cheaper than writeto
  • This can limit type of optimizations. SGD can be fast if only touches the non-missing entries, SGD with momentum can be slow
  • SGD have to be specially implemented for sparse matrix

    Transparent Interchangeability with Dense Matrix

I think the basic answer is no, mainly because it is hard to optimize for general sparse pattern(sparse sparse addition). We want to be able to handle sparse input really well.

Design Choices

This is what we like to discussion

  • choice 1: Make SparseInputEmbedding take dense input weight, and output dense weight gradient.

    • This benefit from generality of the program, and can make use of optimizers.

    • This suffers a lot from not fully utilizing the sparse pattern when batch size is small

  • choice 2: Make SparseUpdatingEmbedding take weight as auxiliary variable, also include parameters such as learning rate, and weight decay, will do explicit SGD update on weight

    • This benefit a lot from utilizing the sparse pattern, especially on CPU.

    • A bit hard to do data parallel on multiple GPU.

We have observed the same thing in EmbeddingOp for NLP, where users complain about its memory utilization due to choice 1. So I would expect different choices works for difference cases, and maybe we even need to have both, with clear naming convention for the second one

Call for Contribution Roadmap

Most helpful comment

What's the progress of this issue?

All 18 comments

@antinucleon will be leading discussion on this issue

TensorFlow has sparse tensors and indexedslices tensors, which are sparse in first dimension but dense in other dimensions.

I think I would be easier to support the later first and it satisfies most use cases.

Sgd with momentum can also benefit from sparse update by recording iteration count of the last update.

One other feature we are missing is stride and general slicing.

I vote for the first choice. However, I am thinking about wrap indptr, indices and values into CSR matrix directly, and make transparent convert to scipy.csr

@antinucleon #1 doesn't work for large vocabulary unless we support row sparse (or indexslices) tensors.

We have around 300M features but each sample has only around 300 non-zeros.
If batch size is small, dense weight and dense gradient will be wasteful.

On Thu, Feb 25, 2016 at 1:22 PM, Eric Junyuan Xie [email protected]
wrote:

#1 doesn't really work for large vocabulary unless we support row sparse
(or indexslices) tensors.

—
Reply to this email directly or view it on GitHub
https://github.com/dmlc/mxnet/issues/1524#issuecomment-188620605.

HONG Chuntao
System Research Group
Microsoft Research Asia

What's the progress of this issue?

no progress at all?

@ProgramCaiCai I've tried for supporting sparsity in MShadow. We may need two key functions. csrmm and csr2csc (Refer to http://docs.nvidia.com/cuda/cusparse/#cusparse-lt-t-gt-csrmm and http://docs.nvidia.com/cuda/cusparse/#cusparse-lt-t-gt-csr2csc). However, I meet some problems when I try to directly call the cuSparse APIs since the sparse matrix stored there is row-major while the dense matrix is column-major. For this reason I haven't continued the implementation. Following is the declaration of the csrmm function.

/*!
 * \brief CPU/GPU: dst = alpha * op(A) * op(B) + beta * dst;
                   op(A) = trans_A ? A^T : A; op(B) = trans_B ? B^T : B;
                   A is stored in Compressed Sparse Row Format (CSR)
                   Refer to https://en.wikipedia.org/wiki/Sparse_matrix
 * \param A_val value of the nnz elements in A
 * \param A_row_ptr pointer to row elements
 * \param A_col_ind column indices of the nnz elements
 * \param B dense matrix
 * \param alpha
 * \param beta
 * \param dst destination, during computation, the dst will be transposed
 */
template<bool trans_A, bool trans_B, typename DType>
inline void csrmm(const Tensor<gpu, 1, DType> &A_val,
                  const Tensor<gpu, 1, int> &A_row_ptr,
                  const Tensor<gpu, 1, int> &A_col_ind,
                  const Tensor<gpu, 2, DType> &B,
                  DType alpha,
                  DType beta,
                  Tensor<gpu, 2, DType> dst);

no progress at all?

Suggestion on the interface SparseInputEmbedding(data_indptr = indptr, data_indices = indices, data_value = values, weight=weight, bias = bias)
data_values should be optional. If not specified, assume all values are 1 -- this is a very common case.

We need to think about reading sparse inputs from disk too. Because each record will require variable storage on disk, depending on the number of non-zero entries. I see a couple of choices here:

1) We adopt a convention like "index=-1 means missing data" and require that all records get -1 padded out to the length of the longest record. This is simple to implement, and general, but space-inefficient.

2) Write custom InputIter classes that understand how to read variable-length records.

I/O option #1 (padding with -1's) gets much more attractive if the files are gzip encoded. That's nice in that it's also super general. (Thanks for the suggestion NB.)

when input vector is sparse, the vector's length is varying with different input. However, the parameter in_args(NDArrayHandle *)'s memories in function MXExecutorBindEX are pre-allocated and have fixed length. Obviously, the two are contradict.

hi, @tqchen @antinucleon, I understand that the initialization process of Static Graph have allocated memory for input or output variable, and that's static allocate memory. When the input is sparse, the using memory is varying. So I wonder static allocate memory can support sparse input. Recently, I want to support sparse input, Can you help me?

@moveforever I agree these two ideas contract: 1) sparse inputs have variable length, because the actual vector length is the number of non-zero entries in that sparse record; 2) pre-allocating fixed memory buffers. Also 3) minibatches, which acts a lot like (2).

I think a good way to solve this is by adopting a convention that a value of "-1" in the index input indicates that this entry should actually be treated as blank (sparse). This means that sparse input data will have to be padded out to a fixed size of maximum number of non-zero entries.

+1 Sparse Matrix support is also needed for Keras. #4173

Was this page helpful?
0 / 5 - 0 ratings

Related issues

luoruisichuan picture luoruisichuan  Â·  3Comments

JonBoyleCoding picture JonBoyleCoding  Â·  3Comments

yuconglin picture yuconglin  Â·  3Comments

Ajoo picture Ajoo  Â·  3Comments

dushoufu picture dushoufu  Â·  3Comments