The work is proportional to the number of elements.
For a mesh of 128000 elements both a serial and 1-thread
simulation carry out the computational work in 2.5 seconds.
For a mesh of 1024000 elements both a serial and 1-thread
simulation carry out the computational work in around 20.0 seconds.
So, eight times more work, eight times longer.
Now comes the weird part. When I use 2 threads, so that each thread
works on 512000 elements, the amount of work per thread is
10 seconds. However the work procedure shows that it consumes
around 16.5 seconds.
When I use 4 threads, each thread works on 256,000 elements,
and consequently the work procedure should execute
in 5 seconds. However, the work procedure actually shows
that it consumes roughly 15.6 seconds.
With 8 threads, each thread works on 128,000 elements,
and the work procedure should only take 2.5 seconds.
However, it reports to take roughly 14 seconds.
The threaded execution therefore looks like this:
Number of elements Number of threads Execution time
per thread
1024000 1 20
512000 2 16.5
256000 4 15.6
128000 8 14
The weird thing is I time the interior of the work procedure.
So that should exclude any overhead associated with threading.
However, as you can see the number of threads actually affects
how much time the work procedure spends doing the work.
The total amount of time farming out the work to the threads is
very small. The total amount of time collecting the data with
wait
pretty much is equal to the amount of time reported by
the work procedure. As if the overhead related to threading was
very small.
The whole thing can be exercised by
git clone https://github.com/PetrKryslUCSD/FinEtoolsDeforNonlinear.jl
followed by
cd FinEtoolsDeforNonlinear.jl
export JULIA_NUM_THREADS=8
julia
and
include("threaded_test.jl")
I'm sorry I don't have a more minimal working example!
So if I understand correctly, this is an instance of poor performance scaling when adding threads?
It is a little bit more interesting (and harder to understand). The numerical work scales (should scale) perfectly with the number of elements to process: 1/8 of the elements, 1/8 of the time. That is not the case here however. For 1 million elements the numerical work takes 20 seconds, so for 1/8 of the elements it should take 2.5 seconds. It doesn't, it takes 14 seconds.
For a scaling problem one would expect that the numerical work would take progressively less time with an increase in the number of threads, but the thread overhead would eventually make adding more threads pointless. What doesn't work here is the time for the numerical work doesn't decrease the way it is expected to. As if the number of threads had something to do with the time it takes to do the numerical work, in some complicated way.
It is almost as if the numerical work for the threads was done in serial instead of in parallel.
It still seems your assume that the rate of work of one thread is independent of how many other threads are running?
Or, are you saying for a fixed number of threads, doubling the number of elements increases the runtime with more than a factor of two?
I started from 1 million elements. Halving that by using two threads, each
operating on 500 thousand elements should allow each thread to finish in
half the time. And so on. That is not observed in my simulations. The
amount of time spent computing in the work routine decreases with an
increase of the number of threads, but in some complicated way, not
matching the above predictions. For example, using 8 threads, each
operating on 1000000/8 elements, the work routine for each thread should
only spend 2.5 seconds computing. The measured time was 14 seconds (!)
spent inside the work routine of each thread.
On Sat, Dec 14, 2019, 10:46 PM Kristoffer Carlsson notifications@github.com
wrote:
It still seems your assume that the rate of work of one thread is
independent of how many other threads are running?Or, are you saying for a fixed number of threads, doubling the number of
elements increases the runtime with more than a factor of two?—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/JuliaLang/julia/issues/34102?email_source=notifications&email_token=ACLGGWGATJOM5W7DHM6KYB3QYVH4DA5CNFSM4J2254AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG4MBIA#issuecomment-565756064,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ACLGGWB4QSVY3YTY3PA3L4LQYVH4DANCNFSM4J2254AA
.
The functions in your thread blocks include loops over elements where each iteration involves linear algebra with small (but not tiny) matrices. It appears that OpenBLAS uses a buffer pool protected by a system lock, so these will generate a lot of ~expensive context switches~ lock contention when multithreaded. If you can face refactoring crucial variables into StaticArrays, you could avoid this problem as well as some GC contention.
edit: it's just a Mutex
A quick check would be to set OPENBLAS_NUM_THREADS=1, so that blas calls do not get multi-threaded.
Thank you for your suggestions. Alas, no difference. I have
julia> ENV
Base.EnvDict with 77 entries:
...
"JULIA_NUM_THREADS" => "8"
...
"OPENBLAS_NUM_THREADS" => "1"
I check the number of threads with
ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ()) = 1
so it is set to 1.
@RalphAS : Would you know what the cutoff for using threading in BLAS in terms of the matrix dimension is?
If I set the number of threads to 1, would BLAS thread synchronization be truly turned off, or is it possible that it is still happening (except for a single thread)? There is at least one place where I could find
if (num_cpu > 0) exec_blas_async_wait(num_cpu, &queue[0]);
....
I have done the threading in two ways, but both get the same result. So I don't think I messed up writing the threaded execution. First using Threads.@spawn
:
tasks = [];
for th in 1:length(threadbuffs)
push!(tasks, Threads.@spawn begin
tim2 = time()
fill!(threadbuffs[th].assembler.F_buffer, 0.0)
# now add the restoring force from the subset of the mesh handled by this thread
restoringforce(threadbuffs[th].femm, threadbuffs[th].assembler, geom, un1, un, tn, dtn, true)
println("Thread $(Threads.threadid()): $(time() - tim2)")
end);
end
println("Farm out work: $(time() - tim1)")
# Wait for the threads to finish, and then add the force from the thread to the global force vector
tim1 = time()
for th in 1:length(tasks)
Threads.wait(tasks[th]);
Fn .+= threadbuffs[th].assembler.F_buffer
end
println("Collect results: $(time() - tim1)")
println("Total: $(time() - tim)")
Second, using the Threads.@threads
threaded loop:
tim = time()
tim1 = time()
Threads.@threads for th in 1:length(threadbuffs)
begin
tim2 = time()
fill!(threadbuffs[th].assembler.F_buffer, 0.0)
# now add the restoring force from the subset of the mesh handled by this thread
restoringforce(threadbuffs[th].femm, threadbuffs[th].assembler, geom, un1, un, tn, dtn, true)
println("Thread $(Threads.threadid()): $(time() - tim2)")
end
end
println("Farm out work: $(time() - tim1)")
# Wait for the threads to finish, and then add the force from the thread to the global force vector
tim1 = time()
for th in 1:length(threadbuffs)
Fn .+= threadbuffs[th].assembler.F_buffer
end
println("Collect results: $(time() - tim1)")
println("Total: $(time() - tim)")
The results have of course been checked for correctness: I get the same results with serial execution, and with an arbitrary number of threads.
I started from 1 million elements. Halving that by using two threads, each operating on 500 thousand elements should allow each thread to finish in half the time.
Only with perfect scaling which there could be many factors that cause it not to be. For element assembly I've observed pretty good scaling though (https://kristofferc.github.io/JuAFEM.jl/dev/examples/threaded_assembly/)
I was talking about the work routine only, i.e. without the thread overhead
On Sun, Dec 15, 2019, 10:07 AM Kristoffer Carlsson notifications@github.com
wrote:
I started from 1 million elements. Halving that by using two threads, each
operating on 500 thousand elements should allow each thread to finish in
half the time.Only with perfect scaling which there could be many factors tgat cause it
not to be. For element assembly I've observed pretty good dvd though (
https://kristofferc.github.io/JuAFEM.jl/dev/examples/threaded_assembly/)—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/JuliaLang/julia/issues/34102?email_source=notifications&email_token=ACLGGWFZSNGFBTLJ2QPZWWLQYXXUJA5CNFSM4J2254AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG4UQCQ#issuecomment-565790730,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ACLGGWFB7AJXYJDHDZMKORTQYXXUJANCNFSM4J2254AA
.
What do you mean without the thread overhead? I still don't understand how you are not assuming that the number of elements per second a thread can do is independent of the total number of threads running concurrently, eg when you say
Halving that by using two threads, each operating on 500 thousand elements should allow each thread to finish in half the time.
The number of elements processed in the WORK ROUTINE (the restoringforce()
in the code above) is independent (or should be) of the number of threads. For instance, when for a mesh of 256K elements I run a single thread, the work done is the same as in each thread when running four (4) threads for a 1024000 mesh. The work routine should consume the same amount of time per thread. That is NOT what I see.
Now when the overhead of threading is added, perfect scaling is unattainable, but that is not my concern here.
Does it make sense now?
The work routine should consume the same amount of time per thread.
Why? The fact that other threads are running concurrently might make this not true?
Of course assuming that each thread can run uninterrupted (which may be the case for a sufficiently low load on the machine).
Remember that the work routine does purely numerical work, of precisely defined amount. No communication, no synchronization.
Remember that the work routine does purely numerical work, of precisely defined amount. No communication, no synchronization.
There is still a number of reasons why you won't get perfect scaling even in purely numerical workloads. You can have false sharing, oversubscription of threads (like the discussion about OpenBLAS), higher pressure on memory, relocating threads to other cores etc etc. Especially allocations I have found to be bad for scaling threaded workloads so make sure you are allocating as little as possible and measure GC time.
Scaling like
1024000 1 20
512000 2 16.5
256000 4 15.6
128000 8 14
doesn't seem too weird at all.
Basically, the comment
The weird thing is I time the interior of the work procedure.
So that should exclude any overhead associated with threading.
is just not true and seems to be the core (no pun intended) of this issue.
There is still a number of reasons why you won't get perfect scaling even in purely numerical workloads. You can have false sharing, oversubscription of threads (like the discussion about OpenBLAS), higher pressure on memory, relocating threads to other cores etc etc. Especially allocations I have found to be bad for scaling threaded workloads so make sure you are allocating as little as possible and measure GC time.
Even with just two threads the timing does not work. I go to great lengths to avoid allocations. Not sure what is going on with blas yet. Setting number of blas threads to 1 did not do anything.
@KristofferC : I think we are talking at cross purposes. It seems to me you mean a situation with all sorts of bad things happening. I mean a situation where things go well. There seems to be something going wrong in my simulations, which might explains the bad timings. The question is what needs to be fixed to make it better.
It would greatly help to have a minimal reproducer.
@PetrKryslUCSD these kinds of long-form discussion on how to optimize an application and get best performance out of threading, are better suited
for discourse (https://discourse.julialang.org/c/user/perf/37) until we established that there is an issue with Julia base at play.
It is hard to judge what the scaling of your application should be and doing so requires some digging in. I would recommend doing some profiling.
@vchuravy : I don't see why this issue was closed. I think we have established that there is some interaction between the threading at Julia source code level and BLAS. In my opinion that is part of Base. As such it should be discussed here.
You can swap out the BLAS calls for generic non-threaded versions and see what happens.
Basically, so far, this has only been an example of an application with non-linear scaling which is more the rule than the exception so it is unclear what to do here.
This seems like a fairly concrete issue with a reproducer, even if not a minimal one. Reopening to at least get to more of a conclusion here.
Ok so what should be done here?
Now this is interesting: how do I swap out threaded blas for non-threaded?
Petr Krysl
On Sun, Dec 15, 2019 at 5:25 PM Kristoffer Carlsson <
[email protected]> wrote:
You can swap out the BLAS calls for generic non-threaded versions.
Basically, so far, this has only been an example of an application with
non-linear scaling which is more the rule than the exception so it is
unclear what to do here.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/JuliaLang/julia/issues/34102?email_source=notifications&email_token=ACLGGWEW3BIRSTODAUCLCXTQYZLBFA5CNFSM4J2254AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG44WYA#issuecomment-565824352,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ACLGGWAA5AYSBDCVEB6YBDTQYZLBFANCNFSM4J2254AA
.
I am replacing all blas calls with hand-coded routines. I will post the results soon.
One way would be to change the deps/blas.mk
in Julia source to disable multi-threading, disable binary builder, and build the Julia source. I thought setting OPENBLAS_NUM_THREADS to 1 was sufficient, but to be absolutely sure, you have to just disable it altogether in the build, which does require you to wade through the Julia build process.
I replaced all (most?; not sure yet) BLAS calls with hand coded Julia. The results are encouraging, but also confusing the issues further, at least in some respect.
The last item is clearly a cause for concern. People won't be able to efficiently utilize Julia threads unless the interaction with lower-level libraries can be resolved.
Yes, Julia and BLAS do not know about each other's threads. We need to work through this: https://github.com/JuliaLang/julia/issues/32786
OpenBLAS has a fairly high threshold for multithreading: for example the version I've studied is serial up to 256K products in GEMM (e.g. 64x64 times 64x64). As I tried to say earlier, the issue here is not oversubscription of threads but competition for access to memory management (similar to GC in Julia itself). For larger matrices there is a plan to make BLAS multi-threading cooperative with Julia's runtime. For middle-sized matrices there may be enough work to make up for the lock contention.
BLAS is just not designed for small matrices; it works, but not well. For tiny matrices (up to 3x3) Julia StdLib
already has some fast code. For the small matrices you have here, good performance takes work - either hand-coding, interfacing to a package like StaticArrays
, or using a library for "batch linear algebra".
We do set the multi-threading threshold to 50. I assume that means multi-threading kicks in only for matrices larger than 50x50. Would be great to see those new interfaces get developed!
OpenBLAS has a fairly high threshold for multithreading: for example the version I've studied is serial up to 256K products in GEMM (e.g. 64x64 times 64x64). As I tried to say earlier, the issue here is not oversubscription of threads but competition for access to memory management (similar to GC
So, to verify that I understand: I think you're saying that BLAS allocates some workspace(s),
and the library protects this with locks in a multi-threading environment. Which then leads to a general slowdown of any code that uses BLAS in any form. Is that a correct understanding of your argument?
BLAS is just not designed for small matrices; it works, but not well. For tiny matrices (up to 3x3) Julia
StdLib
already has some fast code. For the small matrices you have here, good performance takes work - either hand-coding, interfacing to a package likeStaticArrays
, or using a library for "batch linear algebra".
This sounds intriguing: what is "batch linear algebra"?
Thanks for your help.
Upon rewriting my code to avoid BLAS in any form, I achieved the following parallel efficiency
on a better Windows machine (32 cores, 128 gig of memory). Run with the maximum number of threads set to 24 (and the default number of blas threads).
2 threads: 0.95
4 threads: 0.87
16 threads : 0.42
On a Linux machine (64 cores, 256 gigabyte of memory, the maximum number of threads to use set to 32), the following parallel efficiencies could be achieved:
2 threads: 0.96
4 threads: 0.93
16 threads : 0.91
It would appear that the current thread scheduling is quite functional. What does need some improvement is the management of expectations when a user combines user-level threaded code with library-level multithreaded code (such as BLAS). Currently the documentation does not manage expectations well: there should be a clear statement about the current thread scheduler not being able to coordinate with BLAS . From what I have seen, writing a multithreaded code that happens to call BLAS will lead to a disappointment. Furthermore, there should be a warning in the documentation that currently setting the number of BLAS threads to 1 will NOT improve the performance of multithreaded code by reducing contention within the BLAS.
I'd be happy to draft something, should you agree.
We do set the multi-threading threshold to 50. I assume that means multi-threading kicks in only for matrices larger than 50x50.
Does it mean it'd be better to use pure-Julia implementation for matrices smaller than 50x50? For example, we use pure-Julia implementation in mul!
for 2x2 and 3x3. Does it make sense to bump this up to 50?
@tkf does that make more sense than fixing https://github.com/JuliaLang/julia/issues/32786?
My understanding is that what I mentioned is orthogonal to #32786. #32786 is useful if you want to let OpenBLAS use multiple threads (the problem is large enough). Using pure-Julia implementation when the problem is not large enough is useful for avoiding the lock contention https://github.com/JuliaLang/julia/issues/34102#issuecomment-565864782.
Most helpful comment
Upon rewriting my code to avoid BLAS in any form, I achieved the following parallel efficiency
on a better Windows machine (32 cores, 128 gig of memory). Run with the maximum number of threads set to 24 (and the default number of blas threads).
2 threads: 0.95
4 threads: 0.87
16 threads : 0.42
On a Linux machine (64 cores, 256 gigabyte of memory, the maximum number of threads to use set to 32), the following parallel efficiencies could be achieved:
2 threads: 0.96
4 threads: 0.93
16 threads : 0.91
It would appear that the current thread scheduling is quite functional. What does need some improvement is the management of expectations when a user combines user-level threaded code with library-level multithreaded code (such as BLAS). Currently the documentation does not manage expectations well: there should be a clear statement about the current thread scheduler not being able to coordinate with BLAS . From what I have seen, writing a multithreaded code that happens to call BLAS will lead to a disappointment. Furthermore, there should be a warning in the documentation that currently setting the number of BLAS threads to 1 will NOT improve the performance of multithreaded code by reducing contention within the BLAS.
I'd be happy to draft something, should you agree.