Julia: Random extreme slowdowns in multithreaded code

Created on 20 Sep 2019  路  10Comments  路  Source: JuliaLang/julia

Files: https://gist.github.com/fredrikekre/dbe530ecf5fe542fad6564eba25fa0d8

Versioninfo:

julia> versioninfo()
Julia Version 1.4.0-DEV.77
Commit 2021d03 (2019-08-30 12:05 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i5-7600K CPU @ 3.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

When run with 1 thread everything works nicely:

$ export JULIA_NUM_THREADS=1 && julia-master --project run.jl 
  1.232857 seconds (413.48 k allocations: 33.195 MiB)

with 2 threads, almost perfect scaling (consistently):

$ export JULIA_NUM_THREADS=2 && julia-master --project run.jl 
  0.646847 seconds (413.55 k allocations: 33.201 MiB)

with 3 or 4 threads you either get almost perfect scaling, or 100x slowdown:

$ export JULIA_NUM_THREADS=4 && julia-master --project run.jl 
  0.351274 seconds (413.67 k allocations: 33.213 MiB)

$ export JULIA_NUM_THREADS=4 && julia-master --project run.jl 
  0.362900 seconds (413.67 k allocations: 33.213 MiB)

$ export JULIA_NUM_THREADS=4 && julia-master --project run.jl 
 36.957078 seconds (2.40 G allocations: 61.752 GiB, 52.65% gc time)

$ export JULIA_NUM_THREADS=4 && julia-master --project run.jl 
 39.942106 seconds (2.40 G allocations: 61.752 GiB, 50.58% gc time)

$ export JULIA_NUM_THREADS=4 && julia-master --project run.jl 
 38.238709 seconds (2.40 G allocations: 61.752 GiB, 51.50% gc time)

$ export JULIA_NUM_THREADS=4 && julia-master --project run.jl 
  0.375787 seconds (413.66 k allocations: 33.213 MiB)

Edit: This does not happen on Julia 1.1, but I see it on 1.2, 1.3 and master.

multithreading performance

Most helpful comment

Threading in Julia:

Screenshot 2019-09-20 at 13 34 39

All 10 comments

Threading in Julia:

Screenshot 2019-09-20 at 13 34 39

Could you get a profile including C frames?

Updated the gist with a profile for one fast run and one slow.

It looks like the slow case is running an unoptimized version of assemble_cell!. It might be falling back to that if one thread tries to run the function while the other thread is compiling? Some race of that form probably.

Okay, when compiling assemble_cell! first I can not reproduce this.

If you remove the nested @threads (only keeping the innermost one), does it still hang?

that solves it and 4 threads almost same speed as one thread. It can be because the dataset is too easy to process for parallelism.

Okay, seems to be something with the nested @threads then. I don't think it is related to this issue though.

Should I create a separate issue? I can test for maybe two nested threads instead of three and see if it can recover.

Sounds good with a separate issue.

Was this page helpful?
0 / 5 - 0 ratings