Chapel: Fix performance of whole-array operations

Created on 13 Aug 2018 · 11Comments · Source: chapel-lang/chapel

This issue resulted from another question of mine at Stack Overflow.

Rather than repeating the entire description of the problem here, I will just provide a link to the original SO question:
https://stackoverflow.com/questions/51714000/is-there-a-way-to-customize-the-default-parallelization-behavior-of-whole-array

TL,DR: the performance of whole-array operations is significantly worse than the performance one obtains when replacing such operations by _serial_ for loops over arrays. The performance penalty ranges from more than an order of magnitude for arrays of size 300x300, to about a factor of two for array sizes of 600x600x600 (always compared to _serial_ for loops; compared to properly parallelized loop nests it is even larger, see the SO link above).

This is a show-stopper for me, as it precludes me from porting a Python/Numpy code to Chapel that makes extensive use of array operations (the Numpy version being ten times faster than a Chapel version would be)!

From Brad's original answer to the original SO question it emerged that this is known behavior for the multidimensional zippered loop iteration, which is used to implement whole-array operations in Chapel.

I have to wonder then why this algorithm (with its know problems and regardless of how much academic promise it holds, and whether it might be fixed sometime in the future) is still used as the default for implementing whole-array operations in Chapel?

I mean performance-wise this cannot get much worse than it presently is. Why isn't this implemented by default using simple serial for loops, or parallelization over the outermost loop index (see the SO link)? This would make Chapel immediately a factor of 2-25 faster on whole-array operations!

Compiler Performance user issue

Source

ghost

👍3

Most helpful comment

@cfdtech - I can tell that you're frustrated by this issue. The Chapel project has a lot of competing priorities and unfortunately it seems you've run into a problem that hasn't been fixed yet. Some of us have been aware of this issue and have wanted to fix it but so far other efforts such as error handling, memory management, initializers, ... have been higher priority. Of course by bringing up the issue you are encouraging us to address it soon, so we appreciate the feedback.

In the mean time, I've been able to confirm that the programs you posted have Jacobi1 much slower than Jacobi2, on a system with more cores.

In this snippet of code from the SO post:

   XNew[1..n,1..n] =
      ( X[0..n-1,1..n] + X[2..n+1,1..n] +
        X[1..n,2..n+1] + X[1..n,0..n-1] ) / 4.0;

this has a bunch of promoted whole-array operations for 2D arrays. @bradcray commented on your S.O. post that there are issues with inlining and zippered multidimensional array iteration. That's something I think I've confirmed by compiling the serial Jacobi 2 version with (undocumented) flags that turn off iterator inlining: chpl --fast jacobi2.chpl --no-optimize-loop-iterators --no-inline-iterators which reproduces the slow performance (it actually becomes slower than the Jacobi 1 version). The lack of iterator inlining also explains why you were not seeing vectorization (because the C compiler won't vectorize loops with the more complex non-inlined-iterator control flow).

If the lack of iterator inlining really is the problem, as far as I know it's not a fundamental issue but rather an implementation problem within the compiler.

Unfortunately the people who know the most about this issue - @bradcray and @ronawho - are away this week. Perhaps I'll be able to help more with the issue in the meantime, but we might need to wait for them to get back.

mppf on 13 Aug 2018

👍3

All 11 comments

The reason it can't be normal sequential for loop is that there is more to this than parallelism, you have to factor in locality. Serial iterators over distributed arrays result in streams of really slow PUT/GET accesses to the remote memory where forall iterators can and will spawn remote tasks on each locale to eliminate a majority of the communication.

While there is nontrivial overhead to parallel iterators, it's a constant that can be fine tuned over time while the overhead of communication would be linear. Perhaps one way to solve this is to fallback to serial iterators for smaller non-distributed arrays for cases like this.

LouisJenkinsCS on 13 Aug 2018

Thanks for your reply.

Regardless of how a solution to this problem will look in the end (whether using serial iterators or some other approach), the present situation is unbearable and massively hurts the adoption of the language.

I am not sure what CRAY's policy is in this respect. Perhaps they feel just fine serving their supercomputer customers from the DoD and DoE. But this is not going to help the adoption of the language.

For this, Chapel would have to capture some fraction of the large numbers of programmers who are presently using Python/Julia or Fortran. And with such issues being present for non-supercomputer-sized problems, this is simply not going to happen.

ghost on 13 Aug 2018

Sometime ago (with v1.15 or 16), I also experienced slowdown for whole-array operations with short extents (where the array shape => (long, short) and the loop is over the short dimension), and it went away when I used explicit loops. Though I guess it may be not straightforward to automatically switch between serial and parallel execution, I am wondering if it is possible to introduce some "syntactic sugar" with a : notation (similar to that in Fortran and Python/Numpy), such that it will be simply translated to the corresponding serial explicit for-loop. For example,

c[ i, : ] = a [ i , : ] + 2.0 * b[ i, : ];

is automatically translated (by the Chapel compiler, internally) to

for j in <appropriate range here> {
    c[ i, j ] = a[ i, j ] + 2.0 * b[ i, j ];
}

Then, we can use either serial (:) or parallel (..) whole-array operations depending on the needs. Is this possible somehow at the compilation stage...? (To avoid complications, I guess the syntactic sugar can be restricted rather strongly, e.g., such that they cannot be used at the same time in a statement, for example.)

ty1027 on 13 Aug 2018

Thanks for your suggestion. The thing that I am looking for, though, is to have statements like

A = B + C

(without any indexing) perform the same as arrays indexed by for loops. The reason for this is that my Numpy code is written to be generic, i.e. to not depend on the array rank, and introducing a notation like

A[:,:] = B[:,:] + C[:,:]

would destroy this property.

ghost on 13 Aug 2018

In that case, another approach might be to introduce a decorator or a scope-wise directive that changes whole-array operations in a given scope to serial, for example...

Anyway, although I believe my posts above are too naive, I agree with @cfdtech that it would be really great (and probably extremely useful for wider adoption) that whole-array operations can give the highest speed possible somehow (particularly for serial loops, without using explicit loops).

(IIRC, Julia uses @. for auto-transforming a whole-array operation into for-loop (element-wise), and one can use A[:] to traverse a multi-dimensional array. But the syntax might have been changed already, though...)

Edit: My posts above are based on my previous experiences (tests with N-body codes), and I haven't yet tried the code in the SO question above (sorry...)

ty1027 on 13 Aug 2018

I'll have to leave things like language design to @bradcray who currently is on vacation. However, I agree that Chapel needs quite a bit more work, but its incrementally getting there.

Some things I have tried...

Jacobi_1 with 'serial do A = B + C' completes in 30 seconds

Jacobi_1 with 'A = B + C' completes in 3 seconds

Jacobi_2 completes in ~1 second

Jacobi_3 completes in ~0.5 seconds

So what I can gather is that it isn't as simple as serializing it, its more or less the algorithm used (or as @bradcray put it, due to multi-dimensional zippering)

I think maybe the root of the problem can be fixed by fixing multidimensional zippering.

Edit: Performance taken on a Cray-XC50 compute node (44-core Broadwell)

LouisJenkinsCS on 13 Aug 2018

I agree with @LouisJenkinsCS. In all the tests that I have done that avoided multi-dimensional zippering (using whatever other algorithm I tried for indexing the arrays) the performance became predictable.

ghost on 13 Aug 2018

   [j in 1..n] X[n+1,j] = 1.0;
   for 1..niter {
      [(i,j) in {1..n,1..n}] XNew[i,j] = (X[i-1,j]+X[i+1,j]+X[i,j+1]+X[i,j-1])/4.0;
      [(i,j) in {1..n,1..n}] X[i,j] = XNew[i,j];
      iteration += 1;
   }

The above would result in the equivalent kernel. Yes you can't use whole-array operations yet, but using the syntactic sugar for forall loops, where forall elt in elts do something(); is equivalent to
[elt in elts] something();, it can be somewhat concise for the time being.

LouisJenkinsCS on 13 Aug 2018

In the mean time, I've been able to confirm that the programs you posted have Jacobi1 much slower than Jacobi2, on a system with more cores.

In this snippet of code from the SO post:

   XNew[1..n,1..n] =
      ( X[0..n-1,1..n] + X[2..n+1,1..n] +
        X[1..n,2..n+1] + X[1..n,0..n-1] ) / 4.0;

If the lack of iterator inlining really is the problem, as far as I know it's not a fundamental issue but rather an implementation problem within the compiler.

mppf on 13 Aug 2018

👍3

@mppf - Thanks for your reply, and good that you could reproduce the problem.

Waiting for another week won't make a difference for me because I've spent already more time on this issue than I originally planned (I am an applications developer, and not really a language buff). Thanks for taking care of this, though!

ghost on 13 Aug 2018

I'm getting caught up after being on vacation and understand (and share) your frustration. Your account is now listed as @ghost, though, so maybe you've already moved on... As Michael says, we always tend to have more work than we can get done, which has prevented us from addressing the performance problems for multidimensional whole-array operations. In part, this case has been easy for us to put off for now, because most whole-array cases can be rewritten in a more explicit manner that achieves the desired performance. This isn't as nice from a readability perspective as the whole-array form, and it isn't where we want the language to be eventually, but it has permitted us to focus our efforts on issues that have been more pressing for the time being.

Commenting on a few of the things that have been said above:

introducing a notation like
A[:,:] = B[:,:] + C[:,:]
would destroy this property.

I wanted to point out that Chapel supports rank-independent programming even when using explicit indexing via tuple indices. For example, the following array declarations and loop:

var A, B, C: [D] real;

forall ind in D do
  A[ind] = B[ind] + C[ind];

will work whether D is a 1D, 2D, 3D, nD domain (not to mention an associative domain, a sparse domain, etc.). Specifically, if D is 3D, ind will be a 3-tuple of integers which can be used to index into a 3D array like A, B, or C.

While this example is fairly simple, these features have also been used in far more complex settings like a rank-independent AMR framework.

If you have a specific pattern that you're having trouble figuring out how to write in an efficient rank-independent manner in Chapel with loops and indexing, let us know what it is, as we'd be happy to help out.

For those wanting to understand the technical reasons behind why loops like these (implementing multidimensional whole-array ops or zippering) are so much slower than explicit looping and indexing, the challenge is the following: Each array potentially has its own parallel iterators which define how to iterate over a subsection of that array and yield its elements. In general, these iterators are completely independent pieces of code, each of which may contain one or more loops and one or more yield statements. The compiler's challenge in getting good performance is therefore to knit all of these loops together into a single nest that preserves the semantic implementation of the iterator while implementing the fused operation implied by the zippering. When it's unable to figure out how to knit them together, it falls back on a more general, but far less efficient, implementation. Specifically, it represents each iterator as a class implementing a getNext() / isDone()-style iterator interface and performs the zippering by asking them each to advance in lockstep.

In the 1D case, an existing operation kicks in for most parallel iterators because they have a single loop and a single yield statement, so it's understood that the two loops must have the same trip count and can therefore be fused.

In the multidimensional case, our iterators use more complex rank-independent loop idioms which thwart the existing optimizations. [@ronawho, please correct me if you think I've mischaracterized any of the above]

We've discussed improving this in two ways:

(a) by improving the compiler's ability to fuse more complex loop patterns
(b) by improving the multidimensional parallel iterator interface itself to make the distinct per-dimension loops more distinct and obvious to the compiler, simplifying the fusion process.

FWIW, I believe that ultimately, we need (b) in order to do a good job with general user-defined iterators, but I also suspect that there's more that could be done for (a) to improve the common case of our standard built-in iterators on rectangular arrays and domains.