Files: https://gist.github.com/fredrikekre/dbe530ecf5fe542fad6564eba25fa0d8
Versioninfo:
julia> versioninfo()
Julia Version 1.4.0-DEV.77
Commit 2021d03 (2019-08-30 12:05 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i5-7600K CPU @ 3.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
When run with 1 thread everything works nicely:
$ export JULIA_NUM_THREADS=1 && julia-master --project run.jl
1.232857 seconds (413.48 k allocations: 33.195 MiB)
with 2 threads, almost perfect scaling (consistently):
$ export JULIA_NUM_THREADS=2 && julia-master --project run.jl
0.646847 seconds (413.55 k allocations: 33.201 MiB)
with 3 or 4 threads you either get almost perfect scaling, or 100x slowdown:
$ export JULIA_NUM_THREADS=4 && julia-master --project run.jl
0.351274 seconds (413.67 k allocations: 33.213 MiB)
$ export JULIA_NUM_THREADS=4 && julia-master --project run.jl
0.362900 seconds (413.67 k allocations: 33.213 MiB)
$ export JULIA_NUM_THREADS=4 && julia-master --project run.jl
36.957078 seconds (2.40 G allocations: 61.752 GiB, 52.65% gc time)
$ export JULIA_NUM_THREADS=4 && julia-master --project run.jl
39.942106 seconds (2.40 G allocations: 61.752 GiB, 50.58% gc time)
$ export JULIA_NUM_THREADS=4 && julia-master --project run.jl
38.238709 seconds (2.40 G allocations: 61.752 GiB, 51.50% gc time)
$ export JULIA_NUM_THREADS=4 && julia-master --project run.jl
0.375787 seconds (413.66 k allocations: 33.213 MiB)
Edit: This does not happen on Julia 1.1, but I see it on 1.2, 1.3 and master.
Threading in Julia:
Could you get a profile including C frames?
Updated the gist with a profile for one fast run and one slow.
It looks like the slow case is running an unoptimized version of assemble_cell!
. It might be falling back to that if one thread tries to run the function while the other thread is compiling? Some race of that form probably.
Okay, when compiling assemble_cell!
first I can not reproduce this.
If you remove the nested @threads
(only keeping the innermost one), does it still hang?
that solves it and 4 threads almost same speed as one thread. It can be because the dataset is too easy to process for parallelism.
Okay, seems to be something with the nested @threads
then. I don't think it is related to this issue though.
Should I create a separate issue? I can test for maybe two nested threads instead of three and see if it can recover.
Sounds good with a separate issue.
Most helpful comment
Threading in Julia: