Julia: Multithreading performance regression

Created on 26 Jul 2019  Â·  16Comments  Â·  Source: JuliaLang/julia

@oschulz identified a fairly major performance regression with multithreading within SolidStateDetectors; I confirmed the regression, further minimized it, and pushed it to https://github.com/mbauman/MultithreadPerf32701. There's still room for further minimization, but in short:

shell> env JULIA_NUM_THREADS=1 ~/julia/release-1.1/julia --project=. benchmark.jl
[ Info: 1.1.1-pre.2: 1 threads
BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  6
  --------------
  minimum time:     15.537 μs (0.00% GC)
  median time:      15.618 μs (0.00% GC)
  mean time:        15.669 μs (0.00% GC)
  maximum time:     33.078 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
shell> env JULIA_NUM_THREADS=20 ~/julia/release-1.1/julia --project=. benchmark.jl
[ Info: 1.1.1-pre.2: 20 threads
BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  6
  --------------
  minimum time:     3.963 μs (0.00% GC)
  median time:      4.445 μs (0.00% GC)
  mean time:        4.487 μs (0.00% GC)
  maximum time:     6.338 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7
shell> env JULIA_NUM_THREADS=1 julia --project=. benchmark.jl
[ Info: 1.3.0-alpha.10: 1 threads
BenchmarkTools.Trial: 
  memory estimate:  864 bytes
  allocs estimate:  9
  --------------
  minimum time:     20.437 μs (0.00% GC)
  median time:      20.601 μs (0.00% GC)
  mean time:        20.992 μs (0.00% GC)
  maximum time:     73.357 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
shell> env JULIA_NUM_THREADS=20 julia --project=. benchmark.jl
[ Info: 1.3.0-alpha.10: 20 threads
BenchmarkTools.Trial: 
  memory estimate:  14.94 KiB
  allocs estimate:  142
  --------------
  minimum time:     60.928 μs (0.00% GC)
  median time:      76.579 μs (0.00% GC)
  mean time:        78.658 μs (1.86% GC)
  maximum time:     7.549 ms (97.44% GC)
  --------------
  samples:          10000
  evals/sample:     1
multithreading performance regression

Most helpful comment

I'm not in a position to offer patches, but here's some data to help those who are:

Profiling blames much of the overhead on this section of multiq_deletemin in partr.c:

        rn1 = cong(heap_p, cong_unbias, &ptls->rngseed);
        rn2 = cong(heap_p, cong_unbias, &ptls->rngseed);
        prio1 = jl_atomic_load(&heaps[rn1].prio);
        prio2 = jl_atomic_load(&heaps[rn2].prio);

This is part of a block executed hundreds of times per task-startup[1]. Although the profiler points to cong (a simple random-number generator), it seems more plausible that the delay is actually in the atomic loads. In support of this claim, Linux perf reports more than 20% L1 cache misses (and more than 5% L2 misses) for the axpy example described above[2].

I think I've seen benchmarks showing that cilk_for is not so encumbered; IIUC it is also task-based. This makes me impatient in my hope that Julia's new threading model will be both versatile and efficient.


Edit: notes added for clarification:
[1] The "hundreds of times per task" comes from instrumenting partr.c with thread-friendly counters. From what I understand about the multiqueue algorithm, the large iteration count seems to indicate a (well-hidden) bug.
[2] The cache misses were measured on binaries without the counters.

All 16 comments

I'm hoping #32477 will help this at least a bit.

Yes, it does help quite a bit — especially in the 1-thread case — but the multithreading case is still nowhere near 1.1.

$ JULIA_NUM_THREADS=1 julia --project=. benchmark.jl
[ Info: 1.3.0-alpha.19: 1 threads
BenchmarkTools.Trial:
  memory estimate:  928 bytes
  allocs estimate:  9
  --------------
  minimum time:     16.000 μs (0.00% GC)
  median time:      16.115 μs (0.00% GC)
  mean time:        16.530 μs (0.00% GC)
  maximum time:     58.516 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
[master:?][arctic1:~/test_thread_perf]
1 $ JULIA_NUM_THREADS=20 julia --project=. benchmark.jl
[ Info: 1.3.0-alpha.19: 20 threads
BenchmarkTools.Trial:
  memory estimate:  14.97 KiB
  allocs estimate:  140
  --------------
  minimum time:     17.536 μs (0.00% GC)
  median time:      30.455 μs (0.00% GC)
  mean time:        33.053 μs (3.91% GC)
  maximum time:     5.826 ms (97.33% GC)
  --------------
  samples:          10000
  evals/sample:     1

CC @lmh91, who brought this to our attention.

The second-most frequent class of threading questions seems to be simple loops with small computational load, often dominated by overhead, where this issue is a killer.

To further quantify the matter, I benchmarked various versions of axpy!, for Float64 vectors. (The practical application is obviously not axpy!, but a similar thing without optimized library support.) My baseline is OpenBLAS running 8 threads on 8 cores. "bcast" in the plots is the obvious unthreaded broadcast statement.

Here is v1.3.0-alpha with #32477 applied (N is vector length, nt is thread count):
v1_3_axpy_bench

Observations:

  • Unthreaded broadcast does as well as OpenBLAS for small N.
  • The current overhead of Tasks is crippling in the intermediate range.
  • On my system I need at least 4 threads to get nearly full memory bandwidth (the lower bound for large N), so broadcast falls behind there, but any multithreading scheme eventually benefits.

Here is v1.1.1:
v1_axpy_bench

The old threading scheme was very competitive in the intermediate range.

Here is my threaded implementation for completeness:

function axpy_thread(a,x,y) # yes, I know I switched x and y
    n = length(x)
    if length(y) != n
        throw(ArgumentError("size mismatch"))
    end
    n8 = n >> 3
    @threads for j in 0:n8-1
        @inbounds begin
            @simd for i=1:8
                x[8j+i] += a * y[8j+i]
            end
        end
    end
    @inbounds for j in 8n8+1:n
        x[j] += a * y[j]
    end
end

This is not optimized; hardware-dependent tuning can probably shave off something like half for "small" N. In Julia v1.1.

Just to set expectations, while there are certainly performance improvements that will come before 1.3 final, it's likely that 1.3 will not quite match 1.2's limited but well-tuned threading. It will get there in time, however. If you don't need composable threading and you cannot take any kind of performance hit, it is possible that you will need to skip 1.3 entirely.

Sure - we did encounter this issue with 1.2-rc2 though. :-)

Stefan's numbering is off by one. The 1.1 release (or 1.0) should be used as comparison. By 1.4 it should be better tuned. v1.2 should be skipped entirely, and you'll need to evaluate if v1.3s sufficiently advanced by release (or submit patches to enhance it and accelerate the work)

Stefan's numbering is off by one.

Ah, right, sorry. I just wanted to confirm that the startup delay we see is the same on v1.2-rc2 and v1.3-alpha, in case that information is helpful - but I guess that was clear to you guys already.

v1.2 should be skipped entirely

We'll actually need v1.2 for some things, because of #31670 (thanks a lot for fixing that one, @vtjnash!). But I'm certainly not complaining - I do understand that the partr scheduler is a major change and that some foundations for it (please correct me if that's wrong) were already laid in the upcoming v1.2.

I'm sure we can work around the increased scheduling overhead for now (e.g. by running threads longer and implementing a barrier to sync). I didn't mean to convey impatience - I'm very excited by the new possibilities, and I understand that it may take some time to perfect this.

or submit patches to enhance it and accelerate the work

I would if I could - but I'm not sure I can (accelerate that work) :-). From what @JeffBezanson told me at JuliaCon, the inner workings of this are ... tricky (which may be an understatement).

I'm not in a position to offer patches, but here's some data to help those who are:

Profiling blames much of the overhead on this section of multiq_deletemin in partr.c:

        rn1 = cong(heap_p, cong_unbias, &ptls->rngseed);
        rn2 = cong(heap_p, cong_unbias, &ptls->rngseed);
        prio1 = jl_atomic_load(&heaps[rn1].prio);
        prio2 = jl_atomic_load(&heaps[rn2].prio);

This is part of a block executed hundreds of times per task-startup[1]. Although the profiler points to cong (a simple random-number generator), it seems more plausible that the delay is actually in the atomic loads. In support of this claim, Linux perf reports more than 20% L1 cache misses (and more than 5% L2 misses) for the axpy example described above[2].

I think I've seen benchmarks showing that cilk_for is not so encumbered; IIUC it is also task-based. This makes me impatient in my hope that Julia's new threading model will be both versatile and efficient.


Edit: notes added for clarification:
[1] The "hundreds of times per task" comes from instrumenting partr.c with thread-friendly counters. From what I understand about the multiqueue algorithm, the large iteration count seems to indicate a (well-hidden) bug.
[2] The cache misses were measured on binaries without the counters.

Looks like this issues has disappeared, or at least been substantially mitigated (as least in our application), in Julia v1.2.0 and v1.3.0-rc1. @lmh91 is running a few more tests to confirm.

At least that element-wise computations still perform not ideally under >=1.2, which I post originally here:

using Base.Threads
using LinearAlgebra
using BenchmarkTools

const N = 100000
a = rand(N)
b = rand(N)

function foo(a, b)
    @threads for i ∈ eachindex(a)
        a[i] = (a[i] + b[i]) ^ (a[i] - b[i])
    end
end

@btime foo($a, $b)

And a simple test:

~/codes » /home/opt/julia-1.1.1/bin/julia test.jl                                     pshi@discover
  75.924 μs (1 allocation: 32 bytes)
---------------------------------------------------------------------------------------------------------
~/codes » /home/opt/julia-1.2.0/bin/julia test.jl                                     pshi@discover
  114.931 μs (133 allocations: 13.53 KiB)

It seems there is substantial new overhead just with the @threads call.

function foo(N) Threads.@threads for i in 1:N end end

using BenchmarkTools
@btime foo(10000)

gives (the value of N does not matter)

v1.1: 427.692 ns (1 allocation: 32 bytes)
v1.2: 18.264 μs (28 allocations: 2.25 KiB
v1.3-rc1: 15.304 μs (30 allocations: 2.83 KiB)

It does matter a fair bit for functions with small amounts of work per thread. I'm seeing approximately a 20% performance hit for some tasks. Not a big deal, but I look forward to if/when it is resolved.

If it's relevant, a lot of time on v1.2/v1.3-rc1 is spent in /usr/lib/system/libsystem_kernel.dylib : __psynch_cvwait whereas practically no time is spent there on v1.1.

Looks like this issues has disappeared, or at least been substantially mitigated (as least in our application)

Have to correct myself, we still see a performance regression. We'll put together some benchmark numbers.

Looks like v1.5 does reduce the scheduler overhead quite a bit. Here's a new test, measuring @threads overhead on a 64-core CPU (AMD Epyc 7702P):

| | v1.0 | v1.4 | v1.5.0-beta1.0 |
| --------- |---------- | --------- | --------- |
|1 thread | 0.07 us | 4 us | 2 us |
|8 threads |3 us | 26 us | 11 us |
|64 threads | 5 us | 180 us | 50 us |

So the situation has definitely improved. We're still far from pre-partr times, and the scheduler overhead still scales strongly with the number of parallel tasks (though not quite linearly). Is that scaling behavior of the overhead inherent due to the way partr works, or could that possibly change in the future? I do realize that this is highly non-trivial, and that the flexibility that partr offers (wouldn't want to miss it) can't come for free.

Test code:

using Base.Threads
using BenchmarkTools

function foo!(A::AbstractArray, n::Integer)
    for i in 1:n
        @threads for j in eachindex(A)
            A[j] += A[j] |> abs |> log |> abs |> log |> abs |> log |> abs |> log
        end
    end

    A
end

function bar!(A::AbstractArray, n::Integer)
    for i in 1:n
        for j in eachindex(A)
            A[j] += A[j] |> abs |> log |> abs |> log |> abs |> log |> abs |> log
        end
    end

    A
end

A = rand(nthreads())

@time foo!(A, 10^2);
@time bar!(A, 10^2);

@benchmark foo!($A, 100)
@benchmark bar!($A, 100)

The difference between v1.4 and v1.5 seems to be due to the different way @threads is implemented.

Using the test below (with 64 threads on 64 cores), I get about 35 us per iteration (35 ms in total for n = 1000) in do_work with both v1.4 and v1.5.0-beta1.0. However, I get 190 us per iteration in do_work2 with v1.4, compared to only 50 us with v1.5.0-beta1.0.

Interestingly, do_work is still clearly faster than do_work2 with v1.5.0-beta1.0 (35 us vs. 50 us per iteration), even though do_work is pretty much what the internal threading_run function does. So maybe there's still a little bit of room for optimization in @threads somewhere?

using Base.Threads
using BenchmarkTools

task_func(i::Integer) = let i = i
    () -> begin
    end
end

function do_work(n::Integer)
    for i in 1:n
        # ccall(:jl_enter_threaded_region, Cvoid, ())
        try
            tasks = [Task(task_func(i)) for i in 1:nthreads()]

            for i in eachindex(tasks)
                t = tasks[i]
                t.sticky = true
                ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, i-1)
                schedule(t)
            end

            for i in eachindex(tasks)
                wait(tasks[i])
            end
        finally
            # ccall(:jl_exit_threaded_region, Cvoid, ())
        end    
    end
end

function do_work2(n::Integer)
    for i in 1:n
        @threads for i in 1:nthreads()
        end
    end
end


@time do_work(10^3)
@time do_work2(10^3)

@benchmark do_work(10^3)
@benchmark do_work2(10^3)

Interestingly, do_work is still clearly faster than do_work2 with v1.5.0-beta1.0 (35 us vs. 50 us per iteration), even though do_work is pretty much what the internal threading_run function does. So maybe there's still a little bit of room for optimization in @threads somewhere?

The difference seems to be due to the ccall(:jl_enter_threaded_region, Cvoid, ()) and ccall(:jl_exit_threaded_region, Cvoid, ()) in threading_run. Is that something I should use when using sticky tasks, or spawned tasks in general?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sbromberger picture sbromberger  Â·  3Comments

musm picture musm  Â·  3Comments

manor picture manor  Â·  3Comments

Keno picture Keno  Â·  3Comments

yurivish picture yurivish  Â·  3Comments