Chapel: Sparse Matrix Product Performance Bottleneck

Created on 2 May 2018  Â·  36Comments  Â·  Source: chapel-lang/chapel

There seem to be some issues with dot at scale. First issue is transposition seems to be killing performance. In a test with a ~2.5mil by ~2.5mil sparse matrix with ~3.3mil nonzeros the performance for various operations was as follows:

````
Time to Transpose (and handle) the Matrix: 12.2229
Time to Sum A + AT: 15.9925

Sanity Checks
AT's Number of Nonzeros: 3300832
A + AT's Number of Nonzeros: 6597028

Time to Square A: 15.7839
Nonzeros in A^2: 7963457
Time to Transpose A^2: 31.7755
Time to Square AT: 1779.8
Nonzeros in AT^2: 7963457
````

Second, attempts to calculate (A + A^T)^2, AA^T and A^T A failed to reach completion ( Killed )

Libraries / Modules Performance user issue

All 36 comments

I suspect the transpose is so much more expensive, because it has significantly more nonzero rows than the original matrix. There are additional LinearAlgebra.Sparse.dot() performance improvements that I would consider low-hanging fruit. That may be a first step to tackling this.

We'll also want a performance test that generates a random matrix that can replicate these results, so that we can measure our improvements and compare to other linear algebra libraries for reference.

@ben-albrecht just revisiting this. Now it seems that even the squaring operation is hanging too

Hi guys, this ticket is very important to us. Do you have any sense of scheduling for it?

Do you have any sense of scheduling for it?

@buddha314 - Maybe a good question for @bradcray or @benjamin-robbins

@Tshimanga - can you get us a reproducer in the meantime?

We'll write a generator for a big, sparse matrix and take its transpose, then multiply it. Would that help?

@ben-albrecht reproducing the problem is just a matter of having a large matrix and squaring it. Here is a "generator" but I don't expect it's the best way to build these things. That said, you guys have big machine that'll probably handle it fine.

proc generateRandomSparseMatrix(size: int, sparsity: real) {
  const D: domain(2) = {1..size, 1..size};
  var SD: sparse subdomain(D) dmapped CS();
  var R: [SD] real;
  var da: domain(1) = {1..size};
  var array: [da] int = 1..size;
  var dom: domain(1) = {1..0};
  var indices: [dom] 2*int;
  var N = floor(size*(1-sparsity)): int;
  forall i in array {
    forall j in array {
      indices.push_back((i,j));
    }
  }
  shuffle(indices);
  var sparseids = indices[1..N];
  SD.bulkAdd(sparseids);
  forall (i,j) in sparseids {
    R(i,j) = 1;
  }
  //ndices = zip(array1, array2)
  return R;
}

yeah, like that. the "Royal We" did that...

We'll write a generator for a big, sparse matrix and take its transpose, then multiply it.

Yes, if that's all it takes to reproduce. I suspected that the number of nonzero rows played a significant role in the slowdown, so ideally something that generates matrices similar to the matrices you're trying to square.

@Tshimanga - what are the inputs you would give to the above example? I tried the following and got an array index out of bounds error:

use layoutCS;
use Random;

proc generateRandomSparseMatrix(size: int, sparsity: real) {
  const D: domain(2) = {1..size, 1..size};
  var SD: sparse subdomain(D) dmapped CS();
  var R: [SD] real;
  var da: domain(1) = {1..size};
  var array: [da] int = 1..size;
  var dom: domain(1) = {1..0};
  var indices: [dom] 2*int;
  var N = floor(size*(1-sparsity)): int;
  forall i in array {
    forall j in array {
      indices.push_back((i,j));
    }
  }
  shuffle(indices);
  var sparseids = indices[1..N];
  SD.bulkAdd(sparseids);
  forall (i,j) in sparseids {
    R(i,j) = 1;
  }
  return R;
}

// 2.5 mil x 2.5 mil
var N = (2.5e6):int;
// 3.3 mil nonzeros, sparsity = 1 - density
var s = 1 - (3.3e6 / ((2.5e6)**2));

var A = generateRandomSparseMatrix(N, s);

yeah those inputs should work. I'm not sure why it's erring, I wrote that generator 4 months ago and only used it a couple times. what line did it error at? @ben-albrecht

what line did it error at?

sparseMultiply.chpl:15: error: halt reached - array index out of bounds: (12), which is the array indices.push_back((i,j)) call. Does the above example work for you?

oh you might need to change out the foralls for for loops

@Tshimanga - your generator creates a 1D array with domain 1..size*size for building the indices, which would end up being around 100TB of memory if you were to use the suggested problem size of 2.5e6. I think you have overestimated how big our machines are :)

Can you provide an example that you've confirmed both works on your end and reproduces the performance issue?

Can you tell us how to do those estimates? Probably a basic question for you.

Can you tell us how to do those estimates?

Sure, indices ends up being indices: [1..size*size] 2*int:

2.5e6 * 2.5e6 (size*size) * 2*8 bytes (int(64), int(64)) = 1e14 bytes = 100 TB

So just constructing the domain hits this size, right? that's before we populate the matrix.

So just constructing the domain hits this size, right?

Nope, it's the construction of the array. The domain memory footprint is constant w.r.t. domain size.

All right, so if the matrix has sparsity p, does the footprint become 100TB * p ? Or is that too simple?

CSR array memory usage is O(2*nnz + (numRows+1), so the CSR representation of this matrix (numRows = 2.5e6, nnz = 3.3e6) should only be ~73 MB assuming real(64) element type.

The matrix generator above shouldn't need to create a size*size dense array to create the sparse array.

Just looking at it now. Can you suggest a better strategy?

@ben-albrecht yeah I've never used that generator at scale, nor in the past 3 or 4 months. The matrices that I normally use are our own data. Do you guys have any such matrices lying around? Or a good routine to build them?

My suggestion would be to use the iterator over i and j and check every entry, only adding to SD with probability p. Ben?

Check out MatrixUtils.chpl from the LinearAlgebra performance test suite for an example of a sparse matrix generator.

There's probably a number of improvements that could be made to that as well, but it has been good enough for my purposes so far.

@ben-albrecht sparse matrix generation is a little bit of a digression here though. If the generator you linked will work at scale then we can get back to the main issue of figuring out the dot() performance bottleneck right? We only need a working generator to feed your reproducer.

We only need a working generator to feed your reproducer.

Right, that's essentially what I'm asking for -- a standalone program that reproduces the performance issue, so that whoever ends up tackling this issue doesn't need to spend a lot of effort trying to figure that out.

I completely concur with what Ben's saying here. It's not enough to say "We wrote some code and it doesn't perform well — see if you can manage to write some code that approximates what we've done and then fix it for us." A good issue also needs to include some sort of reproducer of the issue so that a developer can dive into the problem with the assurance that we're both looking at the same thing.

Sure thing. The actual code we wrote looked kinda like

var P: [SD] real;     //  SD a sparse subdomain of D {1..2.5M, 1..2.5M}
// fill P with real numbers to a sparsity of p about 1e-07, nothing special here, this completes
var Q = P.dot(P.T);

and it does not complete. So any matrix of that size / sparsity should reproduce the error. It's possible we don't have the horsepower to pull it off. Ben seems to have a more performant sparse matrix generator than we have.

@ben-albrecht tried to incorporate the populate proc you linked as follows:

module SparseDot {
  use LinearAlgebra,
      LinearAlgebra.Sparse,
      Time,
      Random;


config const size: int = 2500000;
config const nnz: int = 3300000;


var Adom: domain(2) = {1..size, 1..size};
var Asps = CSRDomain(Adom);
var A: [Asps] real;


proc populate(ref A, ref ADom, nnz: int, seed: int) where isCSArr(A) {
  //const nnz = (ADom._value.parentDom.size * sparsity): int;
  var indices: [1..nnz] 2*int;
  var randomIndices = makeRandomStream(eltType=int, seed=seed);
  for idx in indices {
    // Ensure no duplicates
    var newIdx = idx;
    while indices.find(newIdx)(1) {
      newIdx = (randomIndices.getNext(ADom.dim(1).low, ADom.dim(1).high),
                randomIndices.getNext(ADom.dim(2).low, ADom.dim(2).high));
    }
    idx = newIdx;
  }

  ADom += indices;

  var randomReals = makeRandomStream(eltType=real, seed=seed);
  for idx in ADom {
    A[idx] = 2.0; //randomReals.getNext();
  }
  delete randomIndices;
  delete randomReals;
}


var t: Timer;
t.start();
populate(A, Adom, nnz, seed=1);
t.stop();
writeln("Time to Generate Matrix: ",t.elapsed());

var t1: Timer;
t1.start();
var A2 = A.dot(A);
t1.stop();
writeln("Time to Square A: ",t1.elapsed());

}

Got the following error, however, any thoughts on how to mend this?

bk@bk:/projects/reproducers$ make sparse-dot
chpl -Msrc -M/projects/cdo/src -M/projects/numsuch/src -M/projects/charcoal/src -M/projects/chingon/src  -I/usr/local/Cellar/openblas/0.2.20/include -I/usr/include/postgresql -L/usr/local/Cellar/openblas/0.2.20/lib -lblas -o test/test test/SparseDot.chpl; \
./test/test;  \
rm test/test
test/SparseDot.chpl:21: In function 'populate':
test/SparseDot.chpl:35: error: Cannot add indices to a rectangular domain
/bin/sh: 2: ./test/test: not found
rm: cannot remove 'test/test': No such file or directory
Makefile:14: recipe for target 'sparse-dot' failed
make: *** [sparse-dot] Error 1

@Tshimanga - You'll need to pass in the sparse domain to the populate procedure - try this:

populate(A, Asps, nnz, seed=1);

@ben-albrecht Thanks Ben! So this should reproduce the problem, buuuuut, at size=10,000 it takes 13 seconds for me to generate the matrix prior to squaring so I don't think I'll actually be able to confirm that it'll reproduce the problem since at size = 1,000,000 it'll take 10,000 times longer.

module SparseDot {
  use LinearAlgebra,
      LinearAlgebra.Sparse,
      Time,
      Random;


config const size: int = 100000;
config const nnz: int = 200000;


var Adom: domain(2) = {1..size, 1..size};
var Asps = CSRDomain(Adom);
var A: [Asps] real;


proc populate(ref A, ref ADom, nnz: int, seed: int) where isCSArr(A) {
  //const nnz = (ADom._value.parentDom.size * sparsity): int;
  var indices: [1..nnz] 2*int;
  var randomIndices = makeRandomStream(eltType=int, seed=seed);
  for idx in indices {
    // Ensure no duplicates
    var newIdx = idx;
    while indices.find(newIdx)(1) {
      newIdx = (randomIndices.getNext(ADom.dim(1).low, ADom.dim(1).high),
                randomIndices.getNext(ADom.dim(2).low, ADom.dim(2).high));
    }
    idx = newIdx;
  }

  ADom += indices;

  var randomReals = makeRandomStream(eltType=real, seed=seed);
  for idx in ADom {
    A[idx] = 2.0; //randomReals.getNext();
  }
  delete randomIndices;
  delete randomReals;
}


var t: Timer;
t.start();
populate(A, Asps, nnz, seed=1);
t.stop();
writeln("Time to Generate Matrix: ",t.elapsed());

var t1: Timer;
t1.start();
var A2 = A.dot(A);
t1.stop();
writeln("Time to Square A: ",t1.elapsed());

}

it takes 13 seconds for me to generate the matrix prior to squaring so I don't think I'll actually be able to confirm that it'll reproduce the problem

Alright, so we're going to need a faster sparse matrix generator. If we don't actually need the indices to be random, that opens up a lot of options. I opted for something pretty simple - iterate over the rows repeatedly, adding indices at random columns:

use Time,
    Random;

use LinearAlgebra,
    LinearAlgebra.Sparse;


config const size: int = 2.5e6:int;
config const nnz: int = 3.3e6:int;


const D = {1..size, 1..size};

var t: Timer;
writeln('Input size: ', size);
writeln('Input nnz : ', nnz);

writeln('Generating sparse matrix');

t.start();
var spsD = randomCSRDom(D, nnz, seed=1);
var A: [spsD] real = 1.0;
t.stop();
writeln('Time to Generate Matrix: ',t.elapsed());
t.clear();

writeln('Actual nnz: ', A.size);

t.start();
var AT = transpose(A);
t.stop();
writeln('Time to transpose(A): ', t.elapsed());
t.clear();

t.start();
A.dot(A);
t.stop();
writeln('Time to A.dot(A): ', t.elapsed());
t.clear();

t.start();
AT.dot(AT);
t.stop();
writeln('Time to AT.dot(AT): ', t.elapsed());


proc randomCSRDom(D, nnz: int, seed: int) {
  var spsD = CSRDomain(D);
  var indices: [1..nnz] 2*int;
  var randomIndices = makeRandomStream(eltType=int, seed=seed);

  // Duplicates are improbable enough that we're not going to worry about them
  var i = 1;
  for idx in 1..nnz {
    var newIdx = (i, randomIndices.getNext(spsD.dim(2).low, spsD.dim(2).high));
    indices[idx] = newIdx;
    i+=1;
    if i > spsD.dim(1).high {
      i = 1;
    }
  }

  spsD += indices;
  delete randomIndices;
  return spsD;
}

Compiling and running this on my macbook pro (2.8 GHz Intel Core i7) with MKL and master branch, I get the following output/timings:

> chpl sparseDot.chpl --fast
> ./sparseDot
Input size: 2500000
Input nnz : 3300000
Generating sparse matrix
Time to Generate Matrix: 2.89073
Actual nnz: 3299999
Time to transpose(A): 0.728601
Time to A.dot(A): 2.14971
Time to AT.dot(AT): 1.87488

@Tshimanga - Can you confirm you get similar timings on your hardware?

@ben-albrecht Hmm, yes that did work (and the mystery grows...) :

bk@bk:/projects/reproducers$ make sparse-dot
chpl -Msrc -M/projects/cdo/src -M/projects/numsuch/src -M/projects/charcoal/src -M/projects/chingon/src  -I/usr/local/Cellar/openblas/0.2.20/include -I/usr/include/postgresql -L/usr/local/Cellar/openblas/0.2.20/lib -lblas -o test/test test/SparseDot.chpl; \
./test/test;  \
rm test/test
Input size: 2500000
Input nnz : 3300000
Generating sparse matrix
Time to Generate Matrix: 15.5935
Actual nnz: 3299999
Time to transpose(A): 14.8048
Time to A.dot(A): 9.72117
Time to AT.dot(AT): 9.92459
bk@bk:/projects/reproducers$

@ben-albrecht I also checked the result sizes and built what was initially our target (A.plus(AT)).dot(A.plus(AT)):

bk@bk:/projects/reproducers$ make sparse-dot
chpl -Msrc -M/projects/cdo/src -M/projects/numsuch/src -M/projects/charcoal/src -M/projects/chingon/src  -I/usr/local/Cellar/openblas/0.2.20/include -I/usr/include/postgresql -L/usr/local/Cellar/openblas/0.2.20/lib -lblas -o test/test test/SparseDot.chpl; \
./test/test;  \
rm test/test
Input size: 2500000
Input nnz : 3300000
Generating sparse matrix
Time to Generate Matrix: 14.7007
Actual nnz: 3299999
Time to transpose(A): 12.8871
Time to A.dot(A): 9.0121
Number of Nonzeros: 4356069
Time to AT.dot(AT): 9.25516
Number of Nonzeros: 4356069
Time to AT.dot(A): 12.6028
Number of Nonzeros: 3431520
Time to A.dot(AT): 14.402
Number of Nonzeros: 6857870
Time to (A.plus(AT)).dot(A.plus(AT)): 71.9546
Number of Nonzeros: 17169994

So it seems the issue isn't as simple as ".dot() doesn't work", have any ideas? The first thing that comes to my mind is the fact that the matrices I am using aren't random and so maybe the additional structure is causing problem but I don't know :confused:

So it seems the issue isn't as simple as ".dot() doesn't work"

This is why we generally ask for a standalone reproducer for performance issues - it eliminates the iterative guess work on our end.

so maybe the additional structure is causing problem

Sparse matrix multiplication performance can vary depending on matrix properties such as nnz, rows, density, nnz/row, etc. See Table 1 / Figure 4 from this paper as an example of this.

The easiest solution would be to provide us with the production matrix that exhibits the performance issue, but I understand that's not an option since you're working with proprietary data.

I'd suggest tuning the pseudo random matrix generator to reproduce some of the aforementioned matrix properties of your production matrix, e.g. nnz/row, to find a matrix that reproduces the issue. This would be preferred, as it would help identify the cause of the slowdown.

Alternatively, it might be possible to modify your production matrix such you are able to share it with us privately, e.g. setting all nonzero elements to 1.0 and randomizing column indices of each row.

This is why we generally ask for a standalone reproducer for performance issues - it eliminates the iterative guess work on our end.

Indeed :+1:

@Tshimanga - with the matrix you sent me, I ran this test and got the following timings:

Reading and creating A: 2.09944
A.dot(A): 0.678393
transpose(A): 0.533113
AT.dot(AT): 12.1283

Again, I am compiling and running this on my macbook pro (2.8 GHz Intel Core i7) with MKL on the master branch.

Here is my compilation line in case that's helpful:

chpl issue-9445.chpl --fast -lmkl_core -lmkl_intel_lp64 -lmkl_sequential \
      -lpthread -lm -ldl --ccflags -Wno-enum-conversion -sisBLAS_MKL=true  

@ben-albrecht I can confirm that --fast solves the issue. glad to know that it exists now. Thanks for the help!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bradcray picture bradcray  Â·  4Comments

BryantLam picture BryantLam  Â·  3Comments

vasslitvinov picture vasslitvinov  Â·  3Comments

ben-albrecht picture ben-albrecht  Â·  3Comments

bradcray picture bradcray  Â·  3Comments