Julia: support shared-memory parallelism (multithreading)

Created on 20 Dec 2012  Â·  104Comments  Â·  Source: JuliaLang/julia

I realize that this would require major effort, but I'm hoping that you will keep this on your radar screen: the lack of native support for shared-memory parallelism is a huge shortcoming of Julia.

The reason to support shared-memory parallelism is that it is usually far, far easier to program than distributed-memory parallelism. Especially if you adopt the same infrastructure (already in gcc) as OpenMP or Cilk+, in which the programmer indicates which function calls and loops can be executed in parallel and the runtime system decides how many threads to spawn and how to distribute the load (e.g. by work stealing to parallelize irregular problems, as in Cilk).

Julia is attractive because it combines ease of programming with performance near that of C, but the latter is no longer true if you can simply put "#pragma parallel for" in your C loop and get 10x speedup on your 12-core machine. Distributed arrays cannot compete with this in ease-of-use, especially for problems in which extensive inter-processor communication is required.

(Using the OpenMP runtime system is also nice in that this way the runtime system will share the same worker threads with C libraries like OpenBLAS and FFTW, so that Julia and external C code won't fight over the CPUs.)

multithreading parallel

All 104 comments

You certainly have a point here. Composability with parallel libraries is especially interesting. Work is underway to provide OpenMP support in LLVM, and if that happens it will be easier to interface with the openmp runtime. Otherwise, would it be sufficient to target a particular popular implementation like libgomp?

At a higher lever, it is true that programming distributed memory is much harder, but our thinking has been that once you have a distributed memory implementation it is relatively easy to make it take advantage of shared memory for performance without changing the programming model. Then if you can somehow make distributed memory programming easier, everything will be fine.

It would be very valuable to have shared memory parallelism available, but not 100% satisfying just to tack on another parallel programming model. We could take the approach of adding runtime support, then punting the rest to the user or future API designs though.

It is fairly easy to support the equivalent of pragma parallel for in julia, and we already have @parallel (reduce) for that. It could be certainly made faster on shared memory. The current inconvenience with @parallel is that it works with darrays and multiple julia processes, and our darray implementation still has ways to go. Making @parallel work on regular sequential arrays would give a lot of immediate parallelism for a lot of folks, without modifying the rest of their code. Of course, that would mean having having some kind of a threading model within julia.

Is the LLVM blocks runtime (used by Apple in Grand Central Dispatch) worth considering as an alternative to openmp?

http://libdispatch.macosforge.org
http://compiler-rt.llvm.org
http://llvm.org/svn/llvm-project/compiler-rt/trunk/BlocksRuntime/

Yes, the requirement to use distributed arrays and multiple processes is the whole difficulty -- the key advantage of shared memory is that you have only a single kind of memory to work with. This becomes even more important when you go beyond simple data-parallel circumstances. e.g. parallelizing a sparse-matrix multiply, or a finite-difference time-domain simulation, or an FFT, are all fairly easy in shared memory [at least, before you get into heavy optimization], but are a lot more complicated in distributed memory. And the history of computer science is littered with the corpses of languages that tried to make distributed-memory programming as transparent as shared-memory.

OpenMP seems to be by far the most popular technique for parallelizing compiled C/C++/Fortan code on shared-memory systems, so there is a lot to be said for playing nicely with the OpenMP runtime. But the runtime system may something of an implementation detail that in principle could even be switched at compile-time (or even at runtime). [This may be easier said than done; if you have to pick one, I would go with OpenMP.]

As I understand it, the main issue right now is that Julia's garbage collection is not multithreaded. As a first step, it wouldn't be so terrible if garbage collection were serialized in a multithreaded program, with the understanding that people writing multithreaded code should try to optimize things so that garbage collection can be deferred to happen infrequently.

As Jeff says, I think the key thing is to support shared memory in the runtime with some kind of OpenMP-like semantics via low-level primitives. Julia's macro system is flexible enough that people can then experiment with different syntaxes (although you will probably eventually want to standardize on one syntax to build into Base).

Here's an old issue on a related topic:
https://github.com/JuliaLang/julia/pull/1002

For the record, I'd love this too, but not if it creates disasters.

Intel recently open-sourced its OpenMP library under the BSD.

http://www.openmprtl.org

OpenMP support is forthcoming in clang - although the approach taken is to do it in the clang frontend rather than in the LLVM IR. See the presentation linked to in this article.

http://www.phoronix.com/scan.php?page=news_item&px=MTM2NjE

Since Julia is unlikely to adopt the OpenMP #pragma syntax (I hope), the Clang OpenMP support should not really be relevant; what matters is access to the runtime library.

Right, Julia is unlikely to adopt the OpenMP syntax, but clang having OpenMP capability makes it possible for us to compile FFTW and OpenBLAS with OpenMP on Mac, where we use clang as the default compiler. This is nice to have, when it happens.

Last I checked, it was a bad idea to compile FFTW with clang; gcc gave significantly better performance. I don't know about OpenBLAS. For your precompiled Mac binaries, you might want to look into this.

(Fortunately, the Intel OpenMP library is supposedly ABI-compatible with libgomp, so you can hopefully compile FFTW with gcc and --enable-openmp and still use the Intel runtime.)

It seems that perhaps there are really three distinct (but related) issues being raised here:

  1. Levering shared memory communication on multi-core machines.
  2. Supporting a thread-based programming model
  3. Compatibility with OpenMP libraries

In regards to issue 1, I don't see why Julia would need to change it's programming model to support shared memory communication. (Perhaps I've just revealed my own ignorance regarding Julia.) For example, MPI was designed with distributed computing in mind; however, many MPI implementations (e.g. OpenMPI) are still able to pass messages directly in memory when the sender and receiver are on the same machine. Hopefully, Julia will be able to (if it doesn't already) optimize it's communication strategy in a similar manner.

In regards to issue 2, I don't think it's unanimous that thread based programming models are always the best way to write high-performance parallel algorithms. In fact, I believe the designers of the D programming language decided to implement a message passing model to avoid many of the pitfalls of thread-based programming (see http://www.informit.com/articles/article.aspx?p=1609144 for details). Furthermore, with the exception of very simple programs, getting good performance out of OpenMP usually requires more effort than adding a few pragmas. As far as the simple cases are concerned, I think implementing parallel loop macros might be sufficient to make Julia as convenient as OpenMP.

Unfortunately, I don't have much to offer regarding issue 3.

I hope this helps.

It seems that there has been progress made with OpenMP and CLANG. See below article:

http://www.phoronix.com/scan.php?page=news_item&px=MTQ0NjQ

and

http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-August/065169.html

I have actually been watching the progress of Julia and this issue in particular before jumping into learning a whole new toolset. I genuinely believe this is a must-have for a language/tool like Julia and second the reasoning of stevengj above that this is just so much easier to code for than thinking of distributed memory.

@bsilbaugh, no one is arguing that "thread based programming models are _always_ the best way to write high-performance parallel algorithms," just that they are significantly simpler in many common cases. The whole point of this that in the simple cases where one wants to parallelize an expensive loop (#pragma for) or a recursive tree of calls (#pragma task), you _can_ often get good performance (even if not optimal) just by adding a few pragmas. (Even the difficulty of load-balancing irregular problems has been essentially solved.) These cases come up extremely often in practice in my experience.

And "implementing parallel loop macros" is _not_ sufficient, precisely because it doesn't deal with the issue of _memory_. The problem is not parallelizing the loop, the problem is that in distributed-memory systems you need to decide in advance where things are stored and deal with explicit communication. This is what shared-memory avoids, and this is where automated language tools have been an abject failure in distributed-memory systems for 20 years. This is _not_ a simple problem.

@stevengj Breath. Maybe have some Camomile tea. Look, a cute puppy.

First, I think we agree on a few things:

  1. Julia should do whatever she needs to do under the hood (or skirt?) to optimize inter-process communication. This includes exploiting shared memory and/or system threads when two (or more) processes are running on the same machine.
  2. Simple tasks should be kept simple, and hard things doable.
  3. Julia needs to be competitive with existing technologies such as OpenMP, CUDA, and our beloved (or hated) grandpa MPI.
  4. This is not a simple problem.
  5. That was a darn cute puppy.

Now, let's consider the main point:

The whole point of this that in the simple cases where one wants to parallelize an expensive loop (#pragma for) or a recursive tree of calls (#pragma task).

For simple cases (loops), being able to use OpenMP is nice. No argument. But is it worth exposing another parallel programming model to the user? Another model that the user needs to think about when composing parallel libraries? (Developer A may decide he will only use threads for his library, and developer B will decide that he will only use one-sided comm for his library. But someday, developer C is going to need to put library A together with B, and will have to try to reason about their interaction.) I think this is perhaps where we disagree. We disagree about the idea of bolting on yet another parallel programming model to the language itself. And that's okay.

Out curiosity, have you tried automatic parallelism for some of the use cases you're talking about? Maybe compare a few test cases using between OpenMP and Intel's automatic parallelism just to get a sense of what is possible with automatic parallelism. (I would be willing to do some of the leg work if you would be willing to dig up some good test cases.) If the present state-of-the-art in automatic parallelism is good enough for the kind of problems your talking about, then this might be a way to get shared memory support into Julia without actually requiring Julia users to learn multiple programming models. Obviously, automatic parallelism will take some effort to implement, but then again, the parallel programming model of the future is probably no model; i.e. I suspect multi-threading and message passing will eventually go the way of assembly programming.

@bsilbaugh, Intel's auto-parallelization still requires the language to support shared-memory parallelism (and in fact, Intel's parallelizer is built on _top_ of OpenMP). They probably didn't build it on top of distributed-memory parallelism precisely because it is so difficult to automate memory distribution.

@stevengj Automatic parallelization requires compiler support (which may use OpenMP libraries internally), but it does not require language support. (Otherwise, it wouldn't be automatic.)

You're right that Intel does not support automatic parallelism for distributed memory architectures (nor does OpenMP). But, for the simple cases you alluded to, perhaps it's enough.

The point of my earlier post (which I think was inline with @ViralBShah and @JeffBezanson original comments) was that you may not need to change the current parallel model (i.e. the one-sided communication API supported by the standard library) simply for performance reasons. For example, calling fetch could just as easily dereference a pointer to a block of shared memory instead of pulling data over the network. Depending on julia's JIT capabilities, perhaps even some of these calls (fetch, remote_call, etc) can get optimized out.

@bsilbaugh, by "language support", I mean support in the Julia runtime, which is separate from the compiler. The main obstacle in this whole discussion is that the Julia runtime is not thread-safe. (And the simple cases I alluded to are no longer so simple in distributed-memory situations.)

Yes, you can certainly implement distributed-memory primitives on top of shared memory, but that doesn't address the difference in ease of programming between shared and distributed _programming models_ (regardless of implementation).

For shared memory parallelism, it would be worth looking at the Cilk model too. There is ongoing work to add it to LLVM (http://cilkplus.github.io/). Cilk avoids some of the composition problems that OpenMP has (notably nested parallelism). Though there's no free lunch -- OpenMP also has certain advantages. Another candidate worth understanding is Deterministic Parallel Java (http://dpj.cs.uiuc.edu/DPJ/Home.html). Maybe some of its techniques can be applied in Julia. I think the important thing is to understand the tradeoffs.

@ArchRobison, the semantics of OpenMP have been converging towards those of Cilk for some time now. OpenMP now has #pragma omp task, similar to cilk's spawn model, and it has #pragma omp for schedule(guided) similar to the divide-and-conquer load-balancing technique for loop parallelism in Cilk+. Of course, the _syntax_ is quite different, but no one is proposing to adopt OpenMP syntax in Julia.

So, while I agree that Cilk has a good conceptual model for shared-memory parallelism, that question is somewhat orthogonal to what runtime threading library we use. (LLVM support is apparently somewhat irrelevant here since our syntax is not implemented in LLVM; we just need the runtime library.)

But again, the stumbling block is thread safety in the Julia runtime library.

That is true, but i'm just as worried about thread safety in every other julia library that might be out there.

I'm not sure I understand the latter concern fully. For example, are there really that many functions in base that make use of non-constant global variables? I'm not saying there aren't any---I've written some of them myself---but I don't tend to think of it as a major feature of our code base. Of course with packages there are additional possibilities for conflict, but at least in my own packages I think it's pretty typical of what's in base---a small percentage might need some redesign for thread-safety.

Though OpenMP has adopted tasking, there are fundamental semantic differences with Cilk that impact composability and performance. Tasking in OpenMP is tied to parallel regions. The big advantage and big disadvantage (depending on context) is that the number of threads executing a parallel region must be bound when the region is entered, before the amount of work or potential parallelism is known. (I work with the Cilk/OpenMP/TBB groups at Intel. We've considered lazy schemes to try to circumvent the issue in OpenMP, but the OpenMP standard has pesky features that get in the way.)

I agree that the big stumbling block is the run-time library and existing Julia libraries. Maybe a lint-like tool could inspect Julia libraries for "okay", "definitely not okay", or "a human should take a look"? From my beginner's experience, Julia seems to have much less of the alias analysis misery that stymies such tools for C/C++.

I am indeed worried about packages, and the various native libraries they might call. Thread safety is a pretty significant demand to put on all those things, especially since the failure mode is not an error message (or poor performance) but random horrible behavior.

Julia code is designed to compose together quite promiscuously, so it is hard to say "my code is threaded, so I need to make sure I only use thread-safe libraries" --- one could easily pass a function from one package to an implicitly parallel higher-order function from another package.

@ArchRobison great to have somebody from the Cilk group here.

@ArchRobison, thanks for the clarification, that's very helpful.

Another issue to consider is the generality of the threading model versus ability to detect races. Automatic race detectors can reduce the pain of dealing with threading bugs. Examples are Helgrind, Intel Inspector, and Cilk screen. (See http://supertech.csail.mit.edu/papers/tushara-meng-thesis.pdf for the theory behind one for Cilk.) The efficiency of a race detector is somewhat inverse to the generality of the parallelism model, so it's something to consider in choosing the parallelism model. JIT compilation may be able to reduce the cost somewhat since it can instrument only the memory accesses that might constitute races. (In the jungle of C/C++ binaries that Inspector deals with, we have to instrument just about every access since binary stack frames don't give us much info to do thread escape analysis.)

Is there a large number of non-thread safe Julia C functions? It would be really cool to know which functions are thread safe and which are not. Then one could make these thread safe using a locking mechanism. The following naiv example segfaults although the gc is off:

  jl_gc_disable();
  #pragma omp parallel for
  for(i=0;i < 100; i++)
  {
    // call some jl_... functions
  } 

Nothing is thread safe.

Ok, got that. And I am trying to understand how much effort it is to change that. So in how much places are globals touched? If there are only some low level functions that touch globals, one could make these thread safe so that automatically the high level functions get thread safe.

I do not say: Lets make all jl_* functions thread safe. I just want to get some feeling if this feasible to investigate. In general, I think that in order to meet the goal of this issue one has to:
a) make libjulia thread safe
b) introduce a language feature to make use of this from Julia.

An "easy" step that might suffice for many uses is #4939. The concept is explained more thoroughly in #4580, especially in one of the comments.

Of course, I'm not saying this replaces all possible uses of threading, just that it's a non-disruptive alternative.

@timholy: Yes this is certainly a very interesting approach. Would be interesting what one "loses" when using a multi-processing approach (with shared memory) compared to a multi-threaded approach. Still, if I get that right, one still as to copy between the SharedArray and the regular Array.

I don't think #4939 is a solution to this issue (though it is not a bad feature in itself). That proposal doesn't provide any way to parallelize operations on ordinary Arrays and other data structures, so it still has the complication that you need to specify the data distribution in advance, so standard library functions cannot be faster without intervention and dynamic load-balancing is hard etc. It doesn't change the fact that if you want multi-threaded library routines (BLAS etc.) you have to write them in C.

Let me put it another way. If you don't think we need multi-threaded operations on ordinary Array types with no user intervention, how would you feel if we disabled multithreaded BLAS and LAPACK operations? (Conversely, if they are valuable for BLAS, why aren't they valuable elsewhere?)

@tknopp, there's no copying required (unless for some reason you do so as part of the intialization). The key part of the proposal is that it maintains an Array in the local process but serializes just the "handle", so you can pass an arbitrary SharedArray argument to a function but do not have to copy any data.

One significant disadvantage of the multiprocessing approach is more memory consumption, since there will be multiple instances of Julia. There is almost certainly going to be more overhead associated with IPC than there would be threading (the pcall_bw was essentially an attempt to get around that).

@stevengj, agreed with the fact that it doesn't help with Arrays, although if SharedArrays are easy to use then one might just start using them for any data of sufficient size. However, due to the overhead of IPC we definitely don't have an approach in-the-works that is viable for smaller arrays (unless computation time happens to be large even for a small array). With regards to the data distribution being pre-determined, in certain respects this is incidental: since all processes have complete & fast access to all elements, you could pass an alternate set of indices if you want.

But your main point is what I meant by saying that it doesn't replace all possible uses of threading. This proposal is essentially a way of getting a subset of the benefits without requiring invasive changes to Julia's code base.

@stevengj, @timholy: I know that multithreading is a little bit of an controversal topic (kind of asking a Python hacker to remove the GIL...) and I actually did not want to start a discussion about the pros and cons. I am just interested in whether this is somethings that makes sense to have a look at.

I don't think #4939 is a solution to this issue (though it is not a bad feature in itself). That proposal doesn't provide any way to parallelize operations on ordinary Arrays and other data structures, so it still has the complication that you need to specify the data distribution in advance, so standard library functions cannot be faster without intervention and dynamic load-balancing is hard etc. It doesn't change the fact that if you want multi-threaded library routines (BLAS etc.) you have to write them in C.

To me this is the fundamental problem with the SharedArrays approach. There are situations where you want to do lots of operations in parallel on a regular array – in fact, I would argue that this is the _most_ common situation. I've been advocating for beginning by making array comprehensions in Julia implicitly parallel as a first concurrency step for a fairly long time. I know that this means that anything you do in the comprehension step had better be thread-safe, which means that we're going to need a big lock around cg and gc (code gen and garbage collection), but I think that combined with a safer distributed-lite model like SharedArrays would get us pretty far.

@tknopp, it's a very interesting topic and one worth discussing.

It's worth noting that you can, in rare circumstances, get away with ccalls to the pthread routines on functions defined by cfunction. Primarily you have to make sure that (1) all code has already been compiled, and (2) no memory is allocated by any julia routines that you call. The latter is not always easy to achieve.

@stevengj, it looks like your previous comment has been edited since I wrote my reply. To respond to the new version, I didn't say that multithreading of arrays is unnecessary---I was intending to be pretty clear in proposing it as "an alternative approach that might work for some people in some circumstances." Many people and/or applications may not be among those, in which case one should carry on as usual. But for those people that it might help, it's not an irrelevant option to know about, so the cross-links to the other issues are appropriate to have here.

While it's not really relevant to your main point, it may be worth noting that even your example about BLAS is not as far out of reach as it might seem. It would be easy to write a tiled gemm routine using SharedArrays. A subset of BLAS routines are largely or exclusively embarrassingly-parallel, which is where SharedArrays shine. Things that require detailed synchronization among threads/processes would have more overhead and would surely be better done with real threads.

In a crazy world, we would just replace Arrays with SharedArrays. That's crazy, because the memory overhead of a SharedArray is much larger, and Jeff has emphasized in the past how important it is to keep it to a minimum.

@tknopp: as an intermediate first step, if you want to try this, you could attempt to "objectify" the Julia global runtime state so that you can have more than one Julia instance in a single process which can run independently of each other, but each one must be single-threaded. @JeffBezanson has voiced concerns about the performance hit that might introduce, but at least it's only a code generation hit, not a code execution hit (since we always run code natively).

@StefanKarpinski: So by objectifying, you mean that one puts all globals into some struct and uses this as a Julia thread state, right?

My hope is, as you say, that code execution would only rarely need locking so that a global interpreter lock might be feasible without a to large performance hit. In Python, the GIL is touched all the time when executing Python code.

Yes, that's correct. There shouldn't be an lock because I'm talking about having multiple independent Julia instances in the same process. If they don't share any of the same code generation infrastructure, then there's no reason to lock unless LLVM itself isn't threadsafe, which I don't think is the case (I may be wrong about that).

Ok, I am not entirly sure if this is the most simple approach as this would for instance required to synchronize the thread states before performing a parallel operation (make sure that local variable are present in all threads)
With a GIL, one would "just" have to introduce locks at all places where globals are touched. But of course the GIL approach has the potential of beeing slower.

This would be a necessary step before making things threadsafe anyway, unless you want to throw locks around everything, which strikes me as not the best approach.

I recommend investigating the multiple instances approach. That's the way a good shared-memory parallel run-time works. (E.g., the separate workers in OpenMP, Cilk, or TBB.) A GIL is going down the wrong path. I've seen an inordinate amount of time spent in attempts to circumvent the GIL in Python.

There exist non-standard compilers that can privatize globals. E.g. look at http://mpc.sourceforge.net/ . Though using it would introduce yet another dependence.

I am not entirely sure if one can compare Julia with Python here. Having a JIT should make an important difference. If the JITed code has only few calls into libjulia (e.g. to invoke the gc) the overhead of the GIL might not be that big. But as my knowledge of the Julia internals is still quite limited I am not so sure about that. Synchronizing Julia thread states also can have an overhead.

But even if one goes for a GIL, it can certainly improve the code structure if it is organized in such a way that the Julia state is held in a struct that could be potentially exchanged.

SharedArray is certainly a stopgap approach, and it would be great to be able to work on Array with multiple processors. I guess we could get closer to this with SharedArray, if we are somehow able to manage a memory pool for medium to large sized arrays that are allocated in a shmem segment. That would make it possible to do away with the copy, and we could perhaps even come up with faster ways to syncrhonize across the processes.

we could perhaps even come up with faster ways to syncrhonize across the processes.

pcall_bw in my original version of SharedArray is exactly that (about 40x faster in my tests). I also have a couple of guesses about where much of the remaining overhead comes from, but I haven't had time to check.

I have been playing around a little bit with threads and libjulia and got the following code to work:

  jl_eval_string("my_func(x) = 2*x");
  jl_function_t *func = jl_get_function(jl_current_module, "my_func");
  int i;

  // here my_func is precompiled
  jl_value_t* arg = jl_box_float64( 0.0 );
  jl_call1(func, arg);

  jl_gc_disable();
  #pragma omp parallel for
  for(i=0;i < 4; i++)
  {

    jl_value_t* arg = jl_box_float64((double)i);
    jl_value_t* retJ = jl_call1(func, arg);
    double ret = jl_unbox_float64(retJ);

    printf("my_func() = %f\n", ret);
  }

which outputs (in different permutations)

my_func() = 0.000000
my_func() = 4.000000
my_func() = 2.000000
my_func() = 6.000000

In the code, it is essential to call the Julia function once in the serial code before calling it in parallel. Further I had to uncomment JL_GC_POP as I have no idea how to unwind the jl_pgcstack pointer when different threads access it.

So although this is of course a totally trivial experiment, it gives me some confidence that introducing locking in the code generation steps could enable locking-free execution of the actual execution code.

Thinking a little bit more about it, it could be a feasible approach to first precompile the Julia code before executing a parallel loop.

@vtjnash @JeffBezanson: I still have not really an idea how to cope with JL_GC_PUSH/POP in a parallel environment. If I understand it correctly, the actual rooting is done in gc.c:670 ( gc_mark_stack(jl_pgcstack, offset, d) ). To make JL_GC_PUSH/POP work in parallel one would somehow need to use an identifier so that POP could pop a certain push. Ideas?

(To make this clear: This is only an experiment and I only want to explore how far I get with respect to multithreading. This is not a concrete proposal)

As threads have separate stacks I believe you would have to run as a separate task in each threads.
That would mean having pgcstack and current_task thread local at least.

One very important use case for threads besides the "OpenMP" use case is GUI development. In all GUI toolkits I know one has one UI thread and for any computation that takes some time one has to start a new thread and run the computation asynchronously.

I have been experimenting a little with Gtk.jl and parallel tasks and it seems to be in principle allow non-blocking UI. Still, I think threading would be more natural here.

@tknopp, for the GUI use case, it seems like you could just use Julia's current distributed memory framework, i.e. separate processes, fairly easily (since the division of labor is known ahead of time and does not involve sharing of complicated data structures or dynamic load balancing).

Thats what I mean with "parallel tasks". Still, my feeling is that such things can easier be done with threads. With multiple processes all async callbacks have to invoke callbacks in the main process. Although with threads one is in a similar situation (all GUI calls must usually be invoked from a single thread), it helps a lot when sharing the same variable scope.

@StefanKarpinski
I like the idea of having multiple julia instances in one OS process. This is kind of akin to the Erlang runtime (BEAM) which has one stack + heap (this is where it's different from threads like in most languages e.g. Java/C#) per Erlang "process" (which is just a thread). This would definitely be a robust solution since you do not have to worry about threads + locks.

Aside from data parallelism there's also task parallelism. Question: when running multiple tasks/coroutines in Julia: do they all get executed on a single OS thread/Core? If they all run on one core, it would be nice if at some point in the future they would get multiplexed onto some sort of "runner" pool that can run these on multiple OS threads/cores. What "runner" means I'll leave it to the core language implementers (internal Julia threads which aren't exposed to application level programmers, julia runtime instances in one process, multiple processes).

@ssagaert have you read through http://docs.julialang.org/en/latest/manual/parallel-computing/ and http://docs.julialang.org/en/latest/manual/control-flow/#man-tasks to get a feel for the multi-process parallelism Julia can do now?

What remains to be determined is the interaction of having both distributed-memory parallelism and shared-memory parallelism exposed in the language and how to design a model and interface that accommodates both. Hopefully we'll end up with something approaching the effectiveness of the MPI+OpenMP combination that is standard in HPC codes today, without as much hassle of worrying about message-passing buffers, compiler wrappers, and all the other headaches involved there.

@ssagaert, Erlang is explicitly designed to be purely functional, which makes having lots of shared data structures between threads trivial. That's much harder with mutation.

yes

Van: Tony Kelman
Verzonden: ‎zondag‎ ‎17‎ ‎augustus‎ ‎2014 ‎20‎:‎43
Aan: JuliaLang/julia
CC: Steven Sagaert

@ssagaert have you read through http://docs.julialang.org/en/latest/manual/parallel-computing/ and http://docs.julialang.org/en/latest/manual/control-flow/#man-tasks to get a feel for the multi-process parallelism Julia can do now?

What remains to be determined is the interaction of having both distributed-memory parallelism and shared-memory parallelism exposed in the language and how to design a model and interface that accommodates both. Hopefully we'll end up with something approaching the effectiveness of the MPI+OpenMP combination that is standard in HPC codes today, without as much hassle of worrying about message-passing buffers, compiler wrappers, and all the other headaches involved there.

—
Reply to this email directly or view it on GitHub.

true but you have the same problem if you do multiprocessing with shared memory right?

Van: Stefan Karpinski
Verzonden: ‎zondag‎ ‎17‎ ‎augustus‎ ‎2014 ‎21‎:‎14
Aan: JuliaLang/julia
CC: Steven Sagaert

@ssagaert, Erlang is explicitly designed to be purely functional, which makes having lots of shared data structures between threads trivial. That's much harder with mutation.

—
Reply to this email directly or view it on GitHub.

true but you have the same problem if you do multiprocessing with shared memory right?

Not nearly as much since only selected objects are shared.

Would it be possible to have a threading model in Julia similar to what C++11 has? Specifically:

  • low-level primitives (std::thread, locks, mutexes, condition variables, futures, etc.),
  • atomics,
  • multi-threaded memory model.

Cilk, OpenMP, etc. are all fine to have, but they can be too constraining/limited and they can easily be built on top of a lower-level threading framework. IMO, in a modern high-performance programming language the possibility to manipulate low-level parallel primitives is essential.

And it is also essential not to be confined to message-passing only. Message-passing is a non-starter for applications in which memory access speed is a bottleneck.

I'd be interested to hear about a situation that can't be handled easily with Cilk or OpenMP. It's unlikely that we're going to by default have people managing their own threads, although it's possible though not certain that will be available as a lower level API.

I am speaking as a C++ developer and I am just starting to read about Julia, so I do not know how much of this is applicable to Julia. But here are my problems with OpenMP:

  • it forces you to think in terms of array-like structures;
  • no support for parallel operations over things like dictionaries, linked lists, binary trees, etc. unless you resort to ugly hacks;
  • poor error handling (e.g., transporting exceptions across threads is not supported),
  • stuff like thread pooling, task queues, etc. is inherently awkward to do in OpenMP,
  • implemented through pragma extensions, so the parallelism happens separately from normal code parsing. You can't use all the information you have in the source code in pragmas (e.g., no support for generic programming, no support for iterators or user-defined types, etc.).

Cilk suffers from similar shortcomings. Personally, over the years I have come to dislike solutions to parallel programming that involve extensions to the language. In C++11, you can write OpenMP-looking code just by using lambdas and threading primitives.

I guess it all comes down do whether Julia wants to be an "array-oriented" type of language (e.g., see Fortran and Numpy, for which the OpenMP approach is sufficient) or provide also more general-purpose tools. Although arrays are so ubiquitous in technical computing, there are real use cases for less elementary data structures (e.g., sparse matrices, polynomials, etc. implemented on top of trees, hash tables, etc.).

Note that I am not saying that high-level, "easy" parallelizing capabilities should not be available. I am trying to argue that the language should give you the ability to go to a lower level should you need to do so, and that it might make sense to implement high-level parallelisation in terms of these low-level primitives.

"Managing your own threads" is tedious with a pthread-like interface, but
is much simplified when using C++11's async and futures. These provide a
functional (in the sense of lambda) and safe (no dead-locks) interface.
Other low-level primitives could be added, but with caution.

I recommend against advertising locks and promises to new users; I agree
that these are difficult to use, and should only be used when necessary,
and then properly encapsulated.

This API is very similar to what Julia currently offers via @schedule and
@spawn (and fetch), except that it is multi-threaded and node-local. In
fact, this is what makes Julia stand out of the crowd, and what got me
interested in it in the first place!

The problem with OpenMP is that most of the pragmas work only for arrays.
OpenMP tasks are more generic, but suffer from having too many barriers,
which greatly impedes parallelism. A typical OpenMP program switches forth
and back between serial and parallel regions, and this doesn't scale well.

-erik

On Wed, Sep 3, 2014 at 8:27 AM, Stefan Karpinski [email protected]
wrote:

I'd be interested to hear about a situation that can't be handled easily
with Cilk or OpenMP. It's unlikely that we're going to by default have
people managing their own threads, although it's possible though not
certain that will be available as a lower level API.

Reply to this email directly or view it on GitHub
https://github.com/JuliaLang/julia/issues/1790#issuecomment-54288526.

Erik Schnetter [email protected]
http://www.perimeterinstitute.ca/personal/eschnetter/

But here are my problems with OpenMP:

  • it forces you to think in terms of array-like structures;
  • no support for parallel operations over things like dictionaries, linked lists, binary trees, etc. unless you resort to ugly hacks;
  • poor error handling (e.g., transporting exceptions across threads is not supported),
  • stuff like thread pooling, task queues, etc. is inherently awkward to do in OpenMP,
  • implemented through pragma extensions, so the parallelism happens separately from normal code parsing. You can't use all the information you have in the source code in pragmas (e.g., no support for generic programming, no support for iterators or user-defined types, etc.).

Cilk suffers from similar shortcomings.

Cilk does have OpenMP-like interfaces, but the core Cilk model is fine-grained asynchronous task queuing with work-stealing, which is in no way limited to arrays. The Cilk model _is_ thread pooling and task-queuing, so that's certainly not awkward in Cilk. (OpenMP these days actually has borrowed a lot of those features from Cilk, so this complaint doesn't hold for either system – but nobody really uses these OpenMP features, afaict.) Obviously, we'll make error handling work well and since Julia has first-class functions, closures and macros, there's no need for awkward stuff like pragmas.

In other words, these concerns, while completely reasonable, don't necessitate exposing pthread-style explicit thread management – they're all addressed by a Cilk style approach. Here's a thread where I described the threading model that we're planning on adopting, which may be helpful. Note that there's nothing special about arrays in this approach.

@eschnett – I basically agree with everything you say and I think that C++11's async and futures features are very close to what we're planning (which, not coincidentally, shares a lot of features with systems like Cilk and TBB, but with better language support, for obvious reasons).

@StefanKarpinski thanks for the explanation, I had minimal exposure to Cilk and all the OpenMP I have seen so far in the wild does not use any of the pooling/queuing stuff (as you point out).

If I read correctly the thread you pointed to, the plan for Julia does not include new special language keywords? The solution you are working towards looks neat (quite similar to C++11's async).

One concern I have with the pooling approach you explained, is that it seems to me it will not be deterministic with respect to which processor runs which task. Speaking again from the perspective of memory-bound computations, if you have a multisocket machine you want to exploit NUMA regions by pinning down specific threads to specific cores - e.g., usually with a first-touch policy you want a thread to init a memory area and be the only one to work on that area in order to get maximum performance. Do you think it would be possible to have support for thread binding and/or user-directed scheduling?

If I read correctly the thread you pointed to, the plan for Julia does not include new special language keywords? The solution you are working towards looks neat (quite similar to C++11's async).

Macros essentially act like user-defined keywords, so at least initially I don't anticipate any need for new keywords. Where you might write async f() in C++, you can write @async f() in Julia with a macro.

One concern I have with the pooling approach you explained, is that it seems to me it will not be deterministic with respect to which processor runs which task. Speaking again from the perspective of memory-bound computations, if you have a multisocket machine you want to exploit NUMA regions by pinning down specific threads to specific cores - e.g., usually with a first-touch policy you want a thread to init a memory area and be the only one to work on that area in order to get maximum performance. Do you think it would be possible to have support for thread binding and/or user-directed scheduling?

We definitely need to expose control over affinity and how tasks map onto specific hardware cores to get maximum performance. It shouldn't be necessary to mess with this by default, but if you need this level of control, you will definitely be able to have it. The model of mapping tasks onto a fixed number of threads is in many ways better for this level of control than pthreads, which attempts to separate the kernel thread abstraction from the reality of hardware cores. By making work threads and CPU cores one-to-one, you remove that abstraction layer, making it easier to control where work actually gets done.

You can always use taskset and similar things to bind threads to CPUs without having Julia involved at all. Forcing processor affinity is an ugly hack that should be done only with explicit user intervention, and only reluctantly. It is basically only something you should do on dedicated compute nodes; on any kind of shared machine it is a disaster because then you can have threads from different processes fighting to "own" certain CPUs.

Balancing locality while maintaining composability is still a bit of an open issue, but I think that just insisting on one particular CPU is generally not a great approach. We'll have to figure out a better way to do that sort of thing.

See also this discussion regarding Cilk. The advantage of the Cilk approach where you break a task into at least ~10 times as many chunks as you have threads, and then threads "steal" work when they are idle, is good load balancing in most cases. You sacrifice some locality this way, but the low-level model where the user assigns work to a particular thread (which is bound to a particular CPU) is really only workable for very simple data-parallel problems where you can load-balance manually (e.g. parallelizing a loop in which every iteration performs identical work), and even then only on dedicated compute nodes where there are no other processes taking time slices on your CPUs.

In designing to allow too many low-level hacks, there's a certain danger of optimizing for artificial benchmarks rather than for real problems.

Cilk delivers good locality when used in its natural style, because then steals are rare, and each worker thread is doing its work serially. Serial execution is hard to beat for locality, at least when thread-private caches are involved. For shared caches, the story is more complicated, but work stealing still works well in practice. The optimal scheduling algorithm for shared caches is parallel depth-first scheduling, which alas seems impractical to implement outside of simulators. Work stealing has been the method of choice for Cilk, TBB, PPL, OpenMP tasking, etc.

I am not fond of the barrier-style of programming made popular by OpenMP for writing resuable components, because barriers introduce composability issues. The problem is that the barrier style fundamentally assumes that you know how many parallel threads of execution you have at the point of starting parallel work. Finding out the answer to that question at any moment in a large system (where other components vie for threads) is exactly the sort of global question that programmers learn not to ask in a scalable parallel program.

Put another way, the "control freak" style of parallel programming ("I totally control the machine") delivers great benchmark results. But you can't marry two control freaks.

@bluescarni : It is not true that "Cilk suffers from similar shortcomings" , at least in the Intel implementation. Here's the Cilk take on each point:

  • Cilk does not force thinking in terms of array-like structures. Indeed divide-and-conquer is the natural style for Cilk. Indeed the cilk_for was added later to simplify parallel iteration.
  • Divide-and-conquer can be used for parallel operations on dictionaries and binary trees. Link lists are tough to get much speedup on anyway, though the following is as good as it gets:
    for( Node* p = root; p; p=p->next )
        cilk_spawn foo(p);
    cilk_sync;
  • Cilk _does_ transport an exception from a child task to a parent. It is true that if multiple children throw exceptions, only one is propagated to the parent. Whether this behavior is a bug or feature is a point of debate, and could be changed.
  • Cilk _does_ support generic programming. The cilk_for construct works for random-access iterators, and we're researched ways to extend it to TBB's style of "splittable ranges". I've implemented a large fraction of the parallel algorithms specification using Intel's Cilk Plus (which has an extension to deal with the vector parallelism).
  • Doing explicit thread pooling and task queues become pointless in Cilk since Cilk manages does those implicitly.
  • Cilk _does_ support reductions on user-defined types, via hyperobjects

Cilk does not solve all problems. It does not touch the subject of concurrency (as in mandatory thread concurrency). But it's a good approach for composable parallel programming.

Put another way, the "control freak" style of parallel programming ("I totally control the machine") delivers great benchmark results. But you can't marry two control freaks.

Love this quote!

I was just scrolling down to the bottom of this conversation to quote Arch, but then found that Stefan had beaten me to it! :P

This is an interesting discussion.

My first comment is that there are quite a lot of common workloads in which you know beforehand exactly how to split your workload in a balanced fashion. Linear algebra with hardware floats is the first obvious one that comes to mind. Splitting a large matrix in stripes/tiles/whatever, moving them in memory areas "close" to the separate cores and pinning the threads is going to be hard to beat with automatic load balancing.

Secondly, scenarios in which one has total control on the hardware and software configuration and maximum throughput is the goal are not uncommon. In my previous job I was working in the oil industry on reservoir simulators, and in that domain the goal is to achieve the highest possible performance on large, big iron clusters completely dedicated to the task of running geophysical simulations for weeks at a time. One of the bottlenecks is often the linear solver, which has to deal with very large sparse matrices. Typically the matrices representing the oil reservoir are split up-front in an (in some sense) "optimal" way and distributed over the nodes of the cluster via MPI. Process pinning and other low-level techniques are commonly adopted to increase performance.

Again, what I am trying to say is that there is value in providing access to low-level primitives, even if they might be of interest only to the 0.1% freaks out there :-) And especially if you are targeting a technical audience.

If you do evenly divided tasks, the degenerate case of work stealing is exactly the static work model, so this isn't really a tradeoff.

But work stealing is not going to give me a deterministic mapping of where a task is run, is it? I need to be able to assign a specific task to be run on a specific core, and this need to be synchronized with the memory allocator as well.

The only case where you don't get deterministic mapping is when you have a load imbalance – otherwise there's never any work stolen. So if you are correct in the way you split up the work evenly, then that's not an issue. Memory allocation will always be thread-local or there's no chance of decent performance.

Maybe I don't understand, but here is what I am actually doing on a real parallel algorithm (sparse polynomial multiplication).

  • I have two polynomials a and b to multiply. a and b are small (a few tens of thousands terms) but they are sparse, so when you multiply them you end up needing a few gigabytes of RAM to store the result;
  • to store the result, I am using an array (simplifying, it's a cache-friendly hash table, but that'll do), a big one;
  • I allocate the whole array from the main thread, but do not actually initialise anything. Separate threads pinned to different cores initialise concurrently the array by writing zeroes to it, so that the array ends up being actually stored in different NUMA regions according to the thread that touched that chunk of the array. Thread 0 inits the first chunk of the array, thread 1 the second one and so on...
  • using the same pinned threads, I start doing the computation and accumulate the result in the output array. I know in advance how to split up the multiplication so that each thread will write only in the chunk of the array "belongs" to it.

So in this scenario I need to be able to bind the threads and two tell exactly which task gets consumed by which thread (as the memory initialisation and the actual multiplication are two different tasks). From what I have understood about the Cilk syntax, it does not seem like it would allow this.

Will the number of threads in the pool be equal to the number of cores ( as said here) or to the number of hardware threads (=# virtual processors that the OS gives you). So on a typical quad core desktop would you have 4 or 8 threads?

The number of threads in the pool as probably best left as a tweaking parameter. On Intel Xeon Phi coprocessors, which have 4 hardware threads per core, we've seen the ideal number of threads per core vary from one to four, depending on the workload. I've never figured out a way to predict the ideal number. The fundamental tension is between:

  • More threads per core can hide the latency of cache misses (and in the case of a Xeon Phi coprocessor, hide instruction decode)
  • More threads per core lets them evict each other's data, resulting in more cache misses.
    For some kernels, it's possible to make the threads on a core constructively share cache (work on the same patch of data) instead of fighting over cache.

@bluescarni : You are correct that Cilk does not offer that level of direct control over machine resources. It would obviate any load balancing.

However, it would appear to be unnecessary to force a particular mapping of physical threads to portions of the array. The requirement, if I understand the problem correctly, is that each thread needs to work on the output page that it initializes. So if the Cilk workers were pinned to hardware threads (the Cilk people here have experimented with this), then the following might work:

  • Allocate the array without initializing it.
  • Run a cilk_for loop where each iteration produces per page of the output array.

Though that said, a common request is to run a second loop later and have similar thread to iteration mapping. That's not supported in Cilk, though it is in TBB's work-stealing scheduler, loosely based on this paper. Still, in the TBB form, the strongest request that can be made is "please favor running this task where this other task ran before". This softer request allows some control while still allowing load balancing.

Other than OpenMP and Click, another approach to consider (for inspiration), is Halide http://halide-lang.org, where algorithm and execution schedule are decoupled.

The Julia language features makes it particularly suitable for such schemas.

Mostly done. I just need to integrate the threading bit.

@timholy what is done ?

Most of Halide syntax, specifically loop reordering, hosting constants, and tiling. Performance on a couple of test examples actually beats (single-threaded) Halide by a bit, surprisingly.

That sounds fantastic. Adding the "threading bit" is precisely the topic of this issue. What are the show stoppers ?
(and where are those tests, I see only blur at https://github.com/timholy/HalideCall.jl/tree/master/examples )

Looks like I forgot to commit the accum example (it's in the Makefile, though). Just fixed.

The main showstopper is splitting the code for each tile into an independently-callable function, so that threads can be dispatched. However, I planned ahead, and at least the code to allocate temporary memory for each tile is annotated, which might make this easier.

So where are we now? Have we decided on a model yet?

See the threads branch.

I think we can close this issue and have more specific issues related to the implementation.

@ViralBShah, I think you mean the jn/threading branch? I looked for the threads branch and accidentally created a branch of this name (and promptly deleted it) since no such branch exists.

I don't think this issue should be closed until threading is merged in some form. You can track an issue much more easily than you can track a branch.

There used to be a threads branch, and a newer threading branch, and eventually the threads branch was removed, IIRC. I believe @JeffBezanson and @kpamnany are getting this up to speed this week. Perhaps worth connecting with them.

We got a fair amount accomplished this week. Here's the status.

  • The latest branch is jn/threading, and it is up to date with master.
  • Threading works with the new GC. Still using a barrier, but @vtjnash has some good progress towards stopping threads with a signal.
  • Progress on signal handling by @vtjnash .
  • Several more things in the runtime are thread safe.
  • @kpamnany has started work on I/O. I believe the plan is to use a single I/O thread.
  • We have a JULIA_ENABLE_THREADING flag to allow inclusion of the threading code, but disabled.

The plan is to merge into master soon after 0.4 branches, with JULIA_ENABLE_THREADING off by default. We need to start making sure everything builds and works in this state on all platforms. We want to build and also run as much of the thread-related code as possible even when threads are disabled.

I'm sensing an impending Travis build matrix addition. :)

I can hear the server fans spinning up from here. Good thing that Travis is roflscale.

We will soon get dedicated hardware for you to play with. Let the build matrix grow with impunity.

Should we close this in favour of more specific issues dealing with multi-threading at this point?

You said that last october. Judging by

I don't think this issue should be closed until threading is merged in some form. You can track an issue much more easily than you can track a branch.

I think the answer there is "not until something gets merged."

Very exciting to hear about the progress here.

I can tell you're just trying to keep me following master rather than focusing on 0.4. Your devious strategy just might work.

The jn/threading branch doesn't seem to build at the moment. I get

/home/jeff/src/julia/src/codegen.cpp: In function ‘void jl_extern_c(jl_function_t*, jl_value_t*, jl_value_t*, char*)’:
/home/jeff/src/julia/src/codegen.cpp:1108:90: error: no matching function for call to ‘llvm::GlobalAlias::create(llvm::Type*, unsigned int, llvm::GlobalValue::LinkageTypes, char*&, llvm::Function*&, llvm::Module*)’
                             GlobalValue::ExternalLinkage, name, llvmf, llvmf->getParent());

and several other errors. @vtjnash Know anything about this? Am I doing something wrong?

@Keno says in https://github.com/JuliaLang/julia/pull/13410: "Just needs a rebase of the threading patches on top of llvm svn."

I fixed the compile errors by removing the #ifdef LLVM38 and letting the #ifdef LLVM37 branch be compiled, wherever there were errors.

it sounds like someone needs to rebase the patches on top of the llvm 3.7.0 release so that it can remain stable and buildable.

I'm using the jn/tlsrebaseagainagain branch of JuliaLang LLVM. Why would that stop working? Maybe I'm missing a step in trying to clean my build state?

Something got merged, so close? It's not enabled by default yet but seems to work at the moment. #9336 covers the prereq LLVM version, and other separate issues can be filed for the occasional bug or segfault that will get found in using the threading code.

It seems to me that this will be really fun to close when we do finally actually have a non-experimental threading interface.

We'll continue making progress at thread-safety with PRs like #17250, but I think the original issue is resolved and I don't see any sub-issues mentioned later (which should be opened as separate issues anyways).

The programming model of what you can do with @threads is still pretty limited, but that can be discussed and developed in future issues/PR's.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

felixrehren picture felixrehren  Â·  3Comments

TotalVerb picture TotalVerb  Â·  3Comments

omus picture omus  Â·  3Comments

ararslan picture ararslan  Â·  3Comments

tkoolen picture tkoolen  Â·  3Comments