Julia: Multithreading race condition: `cannot eval a new struct type definition while defining another type`

Created on 6 Sep 2019  路  10Comments  路  Source: JuliaLang/julia

It appears that two Tasks that attempt to simultaneously eval a new struct type definition will throw an exception due to (presumably) a race-condition.

I noticed this when trying to run the below main() function in BenchmarkTools via @btime main(5, 10):

julia> function work(i, v, n)
           out = v
           for i in 1:n
               mul = @eval (a) -> a*$v
               out = Base.invokelatest(mul, out)
           end
           return out
       end
work (generic function with 1 method)

julia> function main(nmuls, nqueries)
           @sync begin
               for i in 1:nqueries
                   Threads.@spawn begin
                       work(i, 2, nmuls)
                   end
               end
           end
       end
main (generic function with 1 method)

julia> for _ in 1:1000; main(5, 10); end   # This is where I called `@btime main(5,10)`
ERROR: TaskFailedException:
cannot eval a new struct type definition while defining another type
Stacktrace:
 [1] top-level scope at none:0
 [2] top-level scope at REPL[8]:1
 [3] eval at ./boot.jl:330 [inlined]
 [4] work(::Int64, ::Int64, ::Int64) at ./REPL[6]:4
 [5] macro expansion at ./REPL[7]:5 [inlined]
 [6] (::var"##185#186"{Int64,Int64})() at ./threadingconstructs.jl:113
Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:300
 [2] macro expansion at ./task.jl:319 [inlined]
 [3] main(::Int64, ::Int64) at ./REPL[7]:2
 [4] top-level scope at ./REPL[8]:1

It doesn't happen often; I needed to use a sufficiently large number of loops to trigger it.

I see this on both Julia 1.3 and a recent 1.4 from master:

julia> versioninfo()
Julia Version 1.4.0-DEV.79
Commit d3250fe005* (2019-09-02 19:07 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 3

(Note that in this case I was evaling new anonymous function lambdas, not a struct, which I had hoped would be entirely independent, and thus not have any contention. (In fact they _did_ cause contention, since it appears there is a global mutex around compilation, but it also seems like that mutex doesn't cover enough places to protect from this error.))

multithreading

Most helpful comment

That PR does in fact fix the example in this issue --- I probably should have edited that text. There might be other issues with threads at the top level, but cannot eval a new struct type definition while defining another type is gone, so we can probably close this.

All 10 comments

Is there any workaround for this issue? Similar happens to me with a @threads for loop. With JULIA_NUM_THREADS=1 it's working fine. I see the progress in the above PR.

(The usecase is a genetic algorithm, when evaluating the fitness function, for every individual an anonymus function is evald.)

+1 :)

We have been hitting this error again more often now that we're trying to seriously start using Multithreading in our system.

I actually forgot I'd reported this issue a month ago, and I had spent a day already trying to track it down again! 馃槀 Haha thanks so much for the _timely_ ping, @cserteGT3! 馃槀

It looks like @JeffBezanson has been working on a fix, starting 10 days ago: https://github.com/JuliaLang/julia/pull/33553. Thanks Jeff :)


In the much shorter term, since #33553 looks like a large refactoring PR, of the sort that makes backporting unlikely, is it possible we can fix this issue by just adding some more locks? I'd happily trade more contention for not erroring.

As is, this issue will probably be a blocker for us seriously using Mulithreading in the julia 1.3 release. :'(

I actually forgot I'd reported this issue a month ago, and I had spent a day already trying to track it down again! 馃槀 Haha thanks so much for the _timely_ ping, @cserteGT3! 馃槀

you're welcome 馃槀

In the much shorter term, since #33553 looks like a large refactoring PR, of the sort that makes backporting unlikely, is it possible we can fix this issue by just adding some more locks? I'd happily trade more contention for not erroring.

I had the same idea for workaround, and created an MWE based on my use case.

test.jl:

using Base.Threads
using Base: Semaphore, acquire, release

function work(v)
   a = gensym()
   expr = Expr(:function,
      Expr(:tuple, a),
      Expr(:block, Expr(:call, *, a, v)))
   mul = eval(expr)
   out = Base.invokelatest(mul, v)
   return out
end

function main(nqueries)
   @threads for i in 1:nqueries
       work(i)
   end
end

function work2(v, sem)
   a = gensym()
   expr = Expr(:function,
      Expr(:tuple, a),
      Expr(:block, Expr(:call, *, a, v)))
   acquire(sem)
   mul = eval(expr)
   release(sem)
   out = Base.invokelatest(mul, v)
   return out
end

function main2(nqueries)
   s = Semaphore(1)
   @threads for i in 1:nqueries
       work2(i, s)
   end
end

results:

julia> using Base.Threads

julia> nthreads()
4

julia> include("test.jl")
main2 (generic function with 1 method)

julia> main(10)

julia> main(100_000)
ERROR: TaskFailedException:
cannot eval a new struct type definition while defining another type
Stacktrace:
 [1] top-level scope at none:0
 [2] top-level scope at REPL[5]:1
 [3] eval at .\boot.jl:330 [inlined]
 [4] eval(::Expr) at .\client.jl:433
 [5] work(::Int64) at C:\Users\cstamas\Documents\GIT\MasterThesis\racecondition.jl:11
 [6] macro expansion at C:\Users\cstamas\Documents\GIT\MasterThesis\racecondition.jl:18 [inlined]
 [7] (::var"#2#threadsfor_fun#3"{UnitRange{Int64}})(::Bool) at .\threadingconstructs.jl:61
 [8] (::var"#2#threadsfor_fun#3"{UnitRange{Int64}})() at .\threadingconstructs.jl:28
Stacktrace:
 [1] wait(::Task) at .\task.jl:251
 [2] macro expansion at .\threadingconstructs.jl:69 [inlined]
 [3] main(::Int64) at C:\Users\cstamas\Documents\GIT\MasterThesis\racecondition.jl:17
 [4] top-level scope at REPL[5]:1

julia> main2(10)

julia> main2(100_000)

julia> @time main(100)
  0.440021 seconds (53.01 k allocations: 3.235 MiB)

julia> @time main2(100)
  0.454643 seconds (58.63 k allocations: 3.557 MiB)

This works for me on v1.3.0-rc4.1. Tried main2(400_000) which also worked, so I guess this could be used as (one) workaround until the proper solution.

:) Oh, i think you're missing the definition for main2?

Yeah, sorry, updated my comment.

@JeffBezanson would it be a sufficient fix (at least temporarily) to just add locks around all of the typedef functions, as I have done here?:
https://github.com/JuliaLang/julia/compare/master...NHDaly:nhdaly-interpreter-typedef-lock

If that seems reasonable to you, i can open a PR for it. It doesn't seem to have much of a performance impact from what I can tell. If I can get a (smaller) benchmark like the one in the OP above to run on master without it triggering the exception, the performance is actually (surprisingly) unchanged when I add the locks.

If concerned about performance, we could probably make the locking a bit tighter by using a uv_rwlock_t to only write-lock in the small part that currently sets inside_typedef = 1 and have a read-lock for the rest of it, but I was nervous about introducing deadlocks and/or the interactions with GC. I'm interested in your thoughts! 馃槉 I'd love to be able to backport the fix for this to v1.3 (maybe even sneaking it into the rc5? And i'm worried the changes you have going in #33553 look to be too large to backport.

Over at #33553 @JeffBezanson said

This PR tries to clean that up, removing the special cases and hopefully eventually leading to fixing #33183.

Which I guess may have inadvertantly caused github to close this?

That PR does in fact fix the example in this issue --- I probably should have edited that text. There might be other issues with threads at the top level, but cannot eval a new struct type definition while defining another type is gone, so we can probably close this.

Awesome! Onwards and upwards. :) Thanks again!

Yeah you're right about there maybe being other top level issues. Just reran the code in the top comment, and now instead I hit a world-age issue. With a smaller number of iterations, i don't seem to hit it:

julia> for _ in 1:1000; main(5, 10); end   # This is where I called `@btime main(5,10)`

ERROR: TaskFailedException:
MethodError: no method matching (::var"#55504#55511")(::Int64)
The applicable method may be too new: running in world age 54851, while current world is 54891.
Closest candidates are:
  #55504(::Any) at REPL[1]:4 (method too new to be called from this world context.)
Stacktrace:
 [1] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [2] invokelatest at ./essentials.jl:709 [inlined]
 [3] work(::Int64, ::Int64, ::Int64) at ./REPL[1]:5
 [4] macro expansion at ./REPL[2]:5 [inlined]
 [5] (::var"#1#2"{Int64,Int64})() at ./threadingconstructs.jl:146
Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:316
 [2] macro expansion at ./task.jl:335 [inlined]
 [3] main(::Int64, ::Int64) at ./REPL[2]:2
 [4] top-level scope at ./REPL[3]:1

julia> for _ in 1:10; main(5, 10); end   # This is where I called `@btime main(5,10)`

julia> for _ in 1:10; main(5, 10); end   # This is where I called `@btime main(5,10)`

julia> for _ in 1:100; main(5, 10); end   # This is where I called `@btime main(5,10)`

julia> for _ in 1:1000; main(5, 10); end   # This is where I called `@btime main(5,10)`
ERROR: TaskFailedException:
MethodError: no method matching (::var"#111296#111297")(::Int64)
The applicable method may be too new: running in world age 82258, while current world is 82298.
Closest candidates are:
  #111296(::Any) at REPL[1]:4 (method too new to be called from this world context.)
Stacktrace:
 [1] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [2] invokelatest at ./essentials.jl:709 [inlined]
 [3] work(::Int64, ::Int64, ::Int64) at ./REPL[1]:5
 [4] macro expansion at ./REPL[2]:5 [inlined]
 [5] (::var"#1#2"{Int64,Int64})() at ./threadingconstructs.jl:146
Was this page helpful?
0 / 5 - 0 ratings