Here are some notes from digging into the openblas codebase (with @stevengj) to enable partr threading support.
exec_blas
is called by all the routines. The code pattern followed is setting up the work queue and calling exec_blas
to do all the work through an openmp pragma. exec_blas_async
functions.The easiest way may be to modify the openmp threading backend, which seems amenable to something like the fftw partr backend. To start with, we should ignore lapack threading. We could probably just implement an exec_blas_async
fallback that calls exec_blas
(and make exec_blas_async_wait
a no-op).
All of this should work on windows too, although the going through the openmp build route may need some work on the makefiles.
The patch to FFTW should be indicative of something similar to be done for the openblas build.
We now have algorithms in DifferentialEquations.jl which utilize simultaneous implicit methods to enhance the parallelizability of small stiff ODEs and DAEs (i.e. <= 20 ODEs). Right now we'll just document that the user should probably set the BLAS threads to 1, but once this PR is in this algorithm can serve as a very good test case / showcase of why PARTR mixed into BLAS is useful.
This is a fairly straightforward project for someone who doesn't mind diving in and seeing how it was done in FFTW. I will certainly try it out if nobody gives it a shot in a few weeks.
In the long run, it would be good if partr had a documented C API for spawn/wait, which would give us a lot more flexibility in integrating it with external libraries like this.
Do you think this something that will require changes to OpenBLAS upstream and/or compiling OpenBLAS with specific options? Just checking from a packager perspective.
Do you think this something that will require changes to OpenBLAS upstream and/or compiling OpenBLAS with specific options? Just checking from a packager perspective.
Yes, we probably have to work with OpenBLAS upstream
I'm also implementing the FFTW strategy of a pluggable threading backend for Blosc (Blosc/c-blosc2#81).
I think we can make a strong argument to upstream developers that their libraries should use this kind of strategy where possible, because it allows easy composability not only with Julia's partr, but also with Intel's TBB and other threading schedulers. It also seems possible to do this with minimal patches in cases where they have already implemented their own threading.
I think it's attractive to implement this as a runtime option, in addition to existing threading options rather than instead of them, as I did for FFTW and Blosc. That is, we add a single if
statement to the existing exec_blas
functions:
exec_blas(num, queue) {
if (threads_callback) {
// pass work to the callback function
return;
}
// parallelize normally
}
This has three advantages:
Regarding the exec_blas_async
and exec_blas_async_wait
, my hope is that the LAPACK code that calls this could be refactored. My understanding is that it looks something like:
exec_blas_async(queue);
// do some other work
exec_blas_async_wait(queue);
I'm not sure why the "other work" can't simply be added to the queue of parallel tasks, and let the runtime worry about load-balancing.
I posted a very early draft of the requisite changes at xianyi/OpenBLAS#2255
Actually, I thought of an even easier way to implement exec_blas_async
: the Julia callback can just spawn the tasks and return. The parallel tasks can set pthread mutex values to indicate that they are complete, just as they do now, and exec_blas_async_wait
can wait on those mutexes as it does not, without modification to it or the LAPACK source code.
Removing milestone since this certainly wasn't release blocking for 1.3 and neither will be for 1.4 or 1.x.
I'm confused. I thought that now that we switched to a time-based release schedule with 1.x releases, _nothing_ is release-blocking, so should then all the remaining issues be removed from 1.4 milestone as well?
friendly bump on this one. new AMD processors have a ton of threads but I can't take much advantage of PARTR until it works nicely with OpenBLAS since my loops all have various LAPACK calls in them (and I also have standalone LAPACK calls outside of loops that ought to still use all threads)
Increasingly, a lot of libraries in Yggdrasil BB are using openmp, and many of them call BLAS. I suspect that we are increasingly going to see multi-threading clashes between julia threads, pthreaded libraries (openblas), and openmp. The fewer of these we can use the better! I also learnt that if MKL enters the picture, it is yet another library - tbb.
cc @kpamnany
Most helpful comment
I'm also implementing the FFTW strategy of a pluggable threading backend for Blosc (Blosc/c-blosc2#81).
I think we can make a strong argument to upstream developers that their libraries should use this kind of strategy where possible, because it allows easy composability not only with Julia's partr, but also with Intel's TBB and other threading schedulers. It also seems possible to do this with minimal patches in cases where they have already implemented their own threading.