@oschulz identified a fairly major performance regression with multithreading within SolidStateDetectors; I confirmed the regression, further minimized it, and pushed it to https://github.com/mbauman/MultithreadPerf32701. There's still room for further minimization, but in short:
shell> env JULIA_NUM_THREADS=1 ~/julia/release-1.1/julia --project=. benchmark.jl
[ Info: 1.1.1-pre.2: 1 threads
BenchmarkTools.Trial:
memory estimate: 160 bytes
allocs estimate: 6
--------------
minimum time: 15.537 μs (0.00% GC)
median time: 15.618 μs (0.00% GC)
mean time: 15.669 μs (0.00% GC)
maximum time: 33.078 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
shell> env JULIA_NUM_THREADS=20 ~/julia/release-1.1/julia --project=. benchmark.jl
[ Info: 1.1.1-pre.2: 20 threads
BenchmarkTools.Trial:
memory estimate: 160 bytes
allocs estimate: 6
--------------
minimum time: 3.963 μs (0.00% GC)
median time: 4.445 μs (0.00% GC)
mean time: 4.487 μs (0.00% GC)
maximum time: 6.338 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 7
shell> env JULIA_NUM_THREADS=1 julia --project=. benchmark.jl
[ Info: 1.3.0-alpha.10: 1 threads
BenchmarkTools.Trial:
memory estimate: 864 bytes
allocs estimate: 9
--------------
minimum time: 20.437 μs (0.00% GC)
median time: 20.601 μs (0.00% GC)
mean time: 20.992 μs (0.00% GC)
maximum time: 73.357 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
shell> env JULIA_NUM_THREADS=20 julia --project=. benchmark.jl
[ Info: 1.3.0-alpha.10: 20 threads
BenchmarkTools.Trial:
memory estimate: 14.94 KiB
allocs estimate: 142
--------------
minimum time: 60.928 μs (0.00% GC)
median time: 76.579 μs (0.00% GC)
mean time: 78.658 μs (1.86% GC)
maximum time: 7.549 ms (97.44% GC)
--------------
samples: 10000
evals/sample: 1
I'm hoping #32477 will help this at least a bit.
Yes, it does help quite a bit — especially in the 1-thread case — but the multithreading case is still nowhere near 1.1.
$ JULIA_NUM_THREADS=1 julia --project=. benchmark.jl
[ Info: 1.3.0-alpha.19: 1 threads
BenchmarkTools.Trial:
memory estimate: 928 bytes
allocs estimate: 9
--------------
minimum time: 16.000 μs (0.00% GC)
median time: 16.115 μs (0.00% GC)
mean time: 16.530 μs (0.00% GC)
maximum time: 58.516 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
[master:?][arctic1:~/test_thread_perf]
1 $ JULIA_NUM_THREADS=20 julia --project=. benchmark.jl
[ Info: 1.3.0-alpha.19: 20 threads
BenchmarkTools.Trial:
memory estimate: 14.97 KiB
allocs estimate: 140
--------------
minimum time: 17.536 μs (0.00% GC)
median time: 30.455 μs (0.00% GC)
mean time: 33.053 μs (3.91% GC)
maximum time: 5.826 ms (97.33% GC)
--------------
samples: 10000
evals/sample: 1
CC @lmh91, who brought this to our attention.
The second-most frequent class of threading questions seems to be simple loops with small computational load, often dominated by overhead, where this issue is a killer.
To further quantify the matter, I benchmarked various versions of axpy!
, for Float64
vectors. (The practical application is obviously not axpy!
, but a similar thing without optimized library support.) My baseline is OpenBLAS running 8 threads on 8 cores. "bcast" in the plots is the obvious unthreaded broadcast statement.
Here is v1.3.0-alpha with #32477 applied (N
is vector length, nt
is thread count):
Observations:
Here is v1.1.1:
The old threading scheme was very competitive in the intermediate range.
Here is my threaded implementation for completeness:
function axpy_thread(a,x,y) # yes, I know I switched x and y
n = length(x)
if length(y) != n
throw(ArgumentError("size mismatch"))
end
n8 = n >> 3
@threads for j in 0:n8-1
@inbounds begin
@simd for i=1:8
x[8j+i] += a * y[8j+i]
end
end
end
@inbounds for j in 8n8+1:n
x[j] += a * y[j]
end
end
This is not optimized; hardware-dependent tuning can probably shave off something like half for "small" N. In Julia v1.1.
Just to set expectations, while there are certainly performance improvements that will come before 1.3 final, it's likely that 1.3 will not quite match 1.2's limited but well-tuned threading. It will get there in time, however. If you don't need composable threading and you cannot take any kind of performance hit, it is possible that you will need to skip 1.3 entirely.
Sure - we did encounter this issue with 1.2-rc2 though. :-)
Stefan's numbering is off by one. The 1.1 release (or 1.0) should be used as comparison. By 1.4 it should be better tuned. v1.2 should be skipped entirely, and you'll need to evaluate if v1.3s sufficiently advanced by release (or submit patches to enhance it and accelerate the work)
Stefan's numbering is off by one.
Ah, right, sorry. I just wanted to confirm that the startup delay we see is the same on v1.2-rc2 and v1.3-alpha, in case that information is helpful - but I guess that was clear to you guys already.
v1.2 should be skipped entirely
We'll actually need v1.2 for some things, because of #31670 (thanks a lot for fixing that one, @vtjnash!). But I'm certainly not complaining - I do understand that the partr scheduler is a major change and that some foundations for it (please correct me if that's wrong) were already laid in the upcoming v1.2.
I'm sure we can work around the increased scheduling overhead for now (e.g. by running threads longer and implementing a barrier to sync). I didn't mean to convey impatience - I'm very excited by the new possibilities, and I understand that it may take some time to perfect this.
or submit patches to enhance it and accelerate the work
I would if I could - but I'm not sure I can (accelerate that work) :-). From what @JeffBezanson told me at JuliaCon, the inner workings of this are ... tricky (which may be an understatement).
I'm not in a position to offer patches, but here's some data to help those who are:
Profiling blames much of the overhead on this section of multiq_deletemin
in partr.c
:
rn1 = cong(heap_p, cong_unbias, &ptls->rngseed);
rn2 = cong(heap_p, cong_unbias, &ptls->rngseed);
prio1 = jl_atomic_load(&heaps[rn1].prio);
prio2 = jl_atomic_load(&heaps[rn2].prio);
This is part of a block executed hundreds of times per task-startup[1]. Although the profiler points to cong
(a simple random-number generator), it seems more plausible that the delay is actually in the atomic loads. In support of this claim, Linux perf
reports more than 20% L1 cache misses (and more than 5% L2 misses) for the axpy example described above[2].
I think I've seen benchmarks showing that cilk_for
is not so encumbered; IIUC it is also task-based. This makes me impatient in my hope that Julia's new threading model will be both versatile and efficient.
Edit: notes added for clarification:
[1] The "hundreds of times per task" comes from instrumenting partr.c with thread-friendly counters. From what I understand about the multiqueue algorithm, the large iteration count seems to indicate a (well-hidden) bug.
[2] The cache misses were measured on binaries without the counters.
Looks like this issues has disappeared, or at least been substantially mitigated (as least in our application), in Julia v1.2.0 and v1.3.0-rc1. @lmh91 is running a few more tests to confirm.
At least that element-wise computations still perform not ideally under >=1.2, which I post originally here:
using Base.Threads
using LinearAlgebra
using BenchmarkTools
const N = 100000
a = rand(N)
b = rand(N)
function foo(a, b)
@threads for i ∈ eachindex(a)
a[i] = (a[i] + b[i]) ^ (a[i] - b[i])
end
end
@btime foo($a, $b)
And a simple test:
~/codes » /home/opt/julia-1.1.1/bin/julia test.jl pshi@discover
75.924 μs (1 allocation: 32 bytes)
---------------------------------------------------------------------------------------------------------
~/codes » /home/opt/julia-1.2.0/bin/julia test.jl pshi@discover
114.931 μs (133 allocations: 13.53 KiB)
It seems there is substantial new overhead just with the @threads
call.
function foo(N) Threads.@threads for i in 1:N end end
using BenchmarkTools
@btime foo(10000)
gives (the value of N
does not matter)
v1.1: 427.692 ns (1 allocation: 32 bytes)
v1.2: 18.264 μs (28 allocations: 2.25 KiB
v1.3-rc1: 15.304 μs (30 allocations: 2.83 KiB)
It does matter a fair bit for functions with small amounts of work per thread. I'm seeing approximately a 20% performance hit for some tasks. Not a big deal, but I look forward to if/when it is resolved.
If it's relevant, a lot of time on v1.2/v1.3-rc1 is spent in /usr/lib/system/libsystem_kernel.dylib : __psynch_cvwait
whereas practically no time is spent there on v1.1.
Looks like this issues has disappeared, or at least been substantially mitigated (as least in our application)
Have to correct myself, we still see a performance regression. We'll put together some benchmark numbers.
Looks like v1.5 does reduce the scheduler overhead quite a bit. Here's a new test, measuring @threads
overhead on a 64-core CPU (AMD Epyc 7702P):
| | v1.0 | v1.4 | v1.5.0-beta1.0 |
| --------- |---------- | --------- | --------- |
|1 thread | 0.07 us | 4 us | 2 us |
|8 threads |3 us | 26 us | 11 us |
|64 threads | 5 us | 180 us | 50 us |
So the situation has definitely improved. We're still far from pre-partr times, and the scheduler overhead still scales strongly with the number of parallel tasks (though not quite linearly). Is that scaling behavior of the overhead inherent due to the way partr works, or could that possibly change in the future? I do realize that this is highly non-trivial, and that the flexibility that partr offers (wouldn't want to miss it) can't come for free.
Test code:
using Base.Threads
using BenchmarkTools
function foo!(A::AbstractArray, n::Integer)
for i in 1:n
@threads for j in eachindex(A)
A[j] += A[j] |> abs |> log |> abs |> log |> abs |> log |> abs |> log
end
end
A
end
function bar!(A::AbstractArray, n::Integer)
for i in 1:n
for j in eachindex(A)
A[j] += A[j] |> abs |> log |> abs |> log |> abs |> log |> abs |> log
end
end
A
end
A = rand(nthreads())
@time foo!(A, 10^2);
@time bar!(A, 10^2);
@benchmark foo!($A, 100)
@benchmark bar!($A, 100)
The difference between v1.4 and v1.5 seems to be due to the different way @threads
is implemented.
Using the test below (with 64 threads on 64 cores), I get about 35 us per iteration (35 ms in total for n = 1000) in do_work
with both v1.4 and v1.5.0-beta1.0. However, I get 190 us per iteration in do_work2
with v1.4, compared to only 50 us with v1.5.0-beta1.0.
Interestingly, do_work
is still clearly faster than do_work2
with v1.5.0-beta1.0 (35 us vs. 50 us per iteration), even though do_work
is pretty much what the internal threading_run
function does. So maybe there's still a little bit of room for optimization in @threads
somewhere?
using Base.Threads
using BenchmarkTools
task_func(i::Integer) = let i = i
() -> begin
end
end
function do_work(n::Integer)
for i in 1:n
# ccall(:jl_enter_threaded_region, Cvoid, ())
try
tasks = [Task(task_func(i)) for i in 1:nthreads()]
for i in eachindex(tasks)
t = tasks[i]
t.sticky = true
ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, i-1)
schedule(t)
end
for i in eachindex(tasks)
wait(tasks[i])
end
finally
# ccall(:jl_exit_threaded_region, Cvoid, ())
end
end
end
function do_work2(n::Integer)
for i in 1:n
@threads for i in 1:nthreads()
end
end
end
@time do_work(10^3)
@time do_work2(10^3)
@benchmark do_work(10^3)
@benchmark do_work2(10^3)
Interestingly,
do_work
is still clearly faster thando_work2
with v1.5.0-beta1.0 (35 us vs. 50 us per iteration), even thoughdo_work
is pretty much what the internalthreading_run
function does. So maybe there's still a little bit of room for optimization in@threads
somewhere?
The difference seems to be due to the ccall(:jl_enter_threaded_region, Cvoid, ())
and ccall(:jl_exit_threaded_region, Cvoid, ())
in threading_run
. Is that something I should use when using sticky tasks, or spawned tasks in general?
Most helpful comment
I'm not in a position to offer patches, but here's some data to help those who are:
Profiling blames much of the overhead on this section of
multiq_deletemin
inpartr.c
:This is part of a block executed hundreds of times per task-startup[1]. Although the profiler points to
cong
(a simple random-number generator), it seems more plausible that the delay is actually in the atomic loads. In support of this claim, Linuxperf
reports more than 20% L1 cache misses (and more than 5% L2 misses) for the axpy example described above[2].I think I've seen benchmarks showing that
cilk_for
is not so encumbered; IIUC it is also task-based. This makes me impatient in my hope that Julia's new threading model will be both versatile and efficient.Edit: notes added for clarification:
[1] The "hundreds of times per task" comes from instrumenting partr.c with thread-friendly counters. From what I understand about the multiqueue algorithm, the large iteration count seems to indicate a (well-hidden) bug.
[2] The cache misses were measured on binaries without the counters.