Incubator-mxnet: [PROPOSAL] Sparse Tensor Support

Created on 20 Mar 2017 · 5Comments · Source: apache/incubator-mxnet

Start a thread to gather inputs/comments/suggestions. @pluskid @tqchen @piiswrong @mli @jermainewang @szha

Frontend Data Structures

Classes

There are two potential ways to represent sparse ndarrays, either create a new class, or add a type to the existing ndarray class. We need to figure out which one is more desirable.

Two python classes:
- NDArray
  Normal NDArray with data stored in dense format.
- SparseNDArray
  NDArray with data stored in sparse format.
```
sparse_mat = mx.sparse_nd(sparse_type = 'coo', indices = indices, values = values, shape = shape)
out = mx.sparse_nd.sparse_tensor_dense_matmul(sparse_mat, dense_mat)
dense_mat = mx.sparse_nd.to_dense(sparse_mat)
```
- pro: user has explicit control of the sparse type
- con: user is forced to think about the sparse types if they want better training performance with sparse data type
NDArray with sparse_type
Under the hood NDArray class will handle sparse-sparse, sparse-dense, dense-dense operators. If a operator is not implemented for sparse data type, the tensor will be converted to dense format first. Warnings will be thrown to notify users that the sparse_type of the ndarray is changed.
- pro: user can reuse most of the existing python interface, except array creation is different.
- con: user may be confused when ndarray becomes dense after a few operations and loses track of its sparsity.

Basic Sparse Ops

Arithmetics: add, sub, mul, sparse_dense_matrix_mul, etc
Conversion: asnumpy, etc
Indexing: getitem, setitem
Creation: empty, zeros, etc
Logic: equal, etc
Attribute: sparse_type, etc

** we'll further discuss what operators to prioritize in another issue, since there are many to implement.

Sparse Types Specs

Add to original NDArray to avoid v-table and API change. A sparse tensor is represented by two dense tensors, index and data. There are two possible sparse formats:

Let K denote the number of tensor dimensions.
Let M be the number rows containing non-zero elements.
Let N be the number of non-zero values in all dimensions of the tensor.
Row sparse tensors (similar to IndexedSlice in tensorflow)
where each row of data data[i] corresponds to X[index[i]]. Rows with all zeros are omitted. Each row is a dense tensor.
-- shape: (K)
-- index shape: (M)
-- data shape: (M, K)
COO sparse tensor where X[tuple(index[i])] = data[i]
-- shape: (K)
-- index shape: (N, K)
-- data shape: (N, K)

Backend Data Structures

NDArray

Instead of creating another NDArray class, we could introduce ChunkType to help store the indices for sparse tensors. This avoids vtable lookups and API changes.

NDArray {

struct Chunk {
  // auxiliary storage handle (e.g. indices)
  Storage::Handle aux_shandle;

  // storage handle from storage engine */
  Storage::Handle shandle;
  ...
};

enum ChunkType {
  kDense,
  kRowSparse,
  kCOOSparse,
  // kMKL, etc
};

ChunkType chunk_type();

TBlob data() { 
  // Error when called on a sparse matrix
  CHECK(chunk_type_ == kDense); ... 
}  

TBlob sparse_data() { 
  // blocks because length_ can change
  this->WaitToRead(); ... 
}  

Tblob sparse_index() { 
  this->WaitToRead(); ...
} 

void SparseResize() { 
  // Change size of actual data. Reallocate when upsizing. 
  // Need to have write lock before calling.
  data_->Reallocate(); ... 
}  

NDArray to_dense();

private:
SparseType chunk_type_;
shared_ptr<Chunk> data_;
};

The reason for introducing ChunkType and making chunk_ opaque is to abstract away auxiliary memory allocated for NDArray(e.g. indices for sparse data, mkl_mem_, etc).

API

NDArray Op Registration with the following interface:

using FComputeNDArray = std::function<void(
  const nnvm::NodeAttrs& attrs,
  const OpContext& ctx,
  const std::vector<NDArray>& inputs,
  const std::vector<OpReqType>& req,
  const std::vector<NDArray>& outputs)>;

Execution Flow

Imperative Execution

For imperative execution, in MXImperativeInvoke, if at least one of input NDArray is sparse, try to use FComputeNDArray. If no such operator is registered but input tensor is sparse, call to_dense on all inputs.

Symbolic Execution

Weights and gradients initialization

Weights are initialized in dense format, gradients are initialized with specified dense/sparse types.

Bind Symbols

The memory for sparse tensors cannot be planned ahead of time and is allocated during op execution. For a graph with sparse tensor operations, we need to infer if any node in the graph takes sparse input, and try to share memory at execution time. There're two ways to infer the chunk type:

Add an InferChunkType pass, which infers the chunk types based on input chunk types. Ctx information will help infer mkl chunks.
Compose InferChunkType, InferShape, InferType into a single InferNDProperty pass, which performs all three. Type, shape and chunk type are wrapped into a struct. Additional NDArray constructor with this struct will be added.

Dispatch and Execution

Extend FComputeExecutor to execute FComputeNDArray besides FCompute. If the input chunk type is sparse while FCompute is expecting dense input, sparse data will be converted to dense format when OpExecutor::Run is invoked.

Since sparse tensors request memory at execution time, we should use a memory pool for operators to allocate/reallocate memory for its output buffer.

KVStore

KVStore would take a sparse ndarray and perform the update. We can simply implement SGD op with the FComputeNDArray interface.
For communication, keys should be the row ids for the parameters.

Source

eric-haibin-lin

👍6

Most helpful comment

to add to this, the inputs (features) can also be sparse vectors (such as Bag of Words), which can then be transformed into dense via a random mapping. So that would be (dense matrix * sparse vector) computations.

nikrao on 4 Apr 2017

👍2

All 5 comments

suggestion for name change:
FComputeNDArray -> FCompute kDense -> kDefault

piiswrong on 20 Mar 2017

The most common format is CSR (or CSC) for matrices (and the sliced indices for tensors). So this should be the top priority. The second priority is COO. This is usually converted to CSR before processing for efficiency.
For many algorithms, SpMv (sparse matrix * dense vector) operation is very common. The second most common is SpGEMM (sparse matrix * sparse matrix)

animakumar on 4 Apr 2017

👍1

nikrao on 4 Apr 2017

👍2

Any reference where the random mapping actually works on sparse feature followed by deep learning ? We run sparse logistic...we tried factorization/svd etc to generate dense mapping for the sparse feature but it's not effective...things like locality sensitive hashing may work but then we can't go back to feature dimension...