When the following code is run with JULIA_NUM_THREADS=4, occasionally (about every 3 runs) I get a segfault on the second threaded loop invocation. I can't reproduce this with JULIA_NUM_THREADS=1.
MWE:
function f()
Threads.@threads for i in 1:Threads.nthreads()
sleep(1)
end
end
println("First loop")
f()
println("Second loop")
f()
println("Done")
Occasional segfault message:
> JULIA_NUM_THREADS=4 julia-nightly tloop.jl
First loop
Second loop
signal (11): Segmentation fault
in expression starting at /home/jpsamaroo/tloop.jl:10
heap_remove at /workspace/srcdir/libuv/src/heap-inl.h:196
uv_timer_stop at /workspace/srcdir/libuv/src/timer.c:97
uv__run_timers at /workspace/srcdir/libuv/src/timer.c:160
uv_run at /workspace/srcdir/libuv/src/unix/core.c:375
jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:421
poptaskref at ./task.jl:568
wait at ./task.jl:575
wait at ./condition.jl:104
stream_wait at ./stream.jl:47
wait at ./asyncevent.jl:116
sleep at ./asyncevent.jl:187 [inlined]
macro expansion at /home/jpsamaroo/tloop.jl:3 [inlined]
#2#threadsfor_fun at ./threadingconstructs.jl:64
#2#threadsfor_fun at ./threadingconstructs.jl:31
unknown function (ip: 0x7faa244052ec)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2043 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2213
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1624 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:594
unknown function (ip: 0xffffffffffffffff)
Allocations: 152504 (Pool: 152439; Big: 65); GC: 0
Segmentation fault (core dumped)
EDIT: Corrected versioninfo():
julia> versioninfo()
Julia Version 1.3.0-DEV.289
Commit e39f498aca (2019-05-25 00:10 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
Environment:
JULIA_REVISE_POLL = 1
JULIA_LOAD_PATH = @dev:@:@v#.#:@stdlib
JULIA_PKG3_PRECOMPILE = 1
@JeffBezanson
Only output is permitted on threads or threaded regions, no other I/O operation should be attempted.
Once the release is final, it would be good to document what is allowed in threaded sections.
To many programmers, sleep
might not seem like an I/O operation (I had to look it up and think about it).
For the sake of completeness here, in Julia, sleep
is defined here, which defines a Timer
calls wait
.
While it's clear that the Timer
is being opened, digging deeper than this would take some work.
Two questions:
sleep
in a thread, is there a way to do it? Would making a system call to sleep
or usleep
(or Sleep
) in a thread be safe?We just need to fix it.
Only output is permitted on threads or threaded regions, no other I/O operation should be attempted.
It would be nice if we can allow I/O to be called from threads, even if that I/O is actually executed by the main thread or libuv.
@kmsquire Yes, libc sleep()
is thread-safe; the libuv event loop used by Timer
and Base I/O is not.
The wait(::Timer)
method protects libuv access with a lock, so it seems ok to use that in threaded blocks, if you construct the Timer
outside the threaded region, as shown in this related example on discourse. I'd like to know if I've violated any rules there.
There is going to be a long tail of operations that are not threadsafe. The plan is to fix them all, not document that they are broken, so in general, there will be no rule to remember which things are safe or not safe, instead people should just file issues when they find anything that isn't threadsafe.
I'm poking around at this problem to see if I can fix it. I have a question about the following stacktrace:
signal (11): Segmentation fault
in expression starting at /home/jpsamaroo/tloop.jl:10
heap_remove at /workspace/srcdir/libuv/src/heap-inl.h:196
uv_timer_stop at /workspace/srcdir/libuv/src/timer.c:97
uv__run_timers at /workspace/srcdir/libuv/src/timer.c:160
uv_run at /workspace/srcdir/libuv/src/unix/core.c:375
jl_task_get_next at /opt/julia-dev/src/partr.c:421
poptaskref at ./task.jl:568
wait at ./task.jl:575
wait at ./condition.jl:104
stream_wait at ./stream.jl:47
wait at ./asyncevent.jl:116
sleep at ./asyncevent.jl:187 [inlined]
macro expansion at /home/jpsamaroo/tloop.jl:3 [inlined]
It seems like the segfault occurs strictly within libuv itself (although this is invoked from jl_task_get_next
, which probably shouldn't be changed to fix this). Does this imply that libuv itself needs to be modified somehow to account for multithreaded sleep
? If so, does anyone have a recommendation on how I can go about accessing jl_uv_mutex
from within libuv (which I assume is what I'd need to lock/unlock when fiddling with timers)?
No I don't think this implies that libuv itself needs to be modified. My guess is that its internal data structures are corrupted here due to access from multiple threads. That is what jl_uv_mutex
is for; we wrap calls to libuv in that. We've talked about more invasive or finer-grained locking inside libuv, but it shouldn't be strictly necessary except possibly for performance at some point.
Most helpful comment
There is going to be a long tail of operations that are not threadsafe. The plan is to fix them all, not document that they are broken, so in general, there will be no rule to remember which things are safe or not safe, instead people should just file issues when they find anything that isn't threadsafe.