I have a trouble with finding a cause of a rare deadlock in my parallel quicksort. I tried my best, with a big help from @chriselrod, to find out the problem but I still have no idea how to fix it.
I am reporting this here since I think there is _some_ chance that there is a bug in julia
(scheduler? compiler?). (In other words I'm hopelessly lost now and all I can do is "blame" the upstream.)
MWE: Install the latest ThreadsX (v0.1.1) and BenchmarkTools. Then run the following code eventually causes a deadlock when JULIA_NUM_THREADS
is _appropriate_ (for me it's 13 for @chriselrod it's 8 but not 7):
using BenchmarkTools, ThreadsX, Random
seed = Ref(0)
while true
@btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)))
end
I've tried git bisect
with this script and found that #32599 changes the way the problem(s) manifest:
With f5dbc471efc3334a8c532f56fb40b06427fa52c7, I observed this segmentation fault:
seed[] += 1 = 247
[ Info: %CPU = 1167.0
[ Info: 45.550472802933335 minutes without a hang....
signal (11): Segmentation fault
in expression starting at none:3
setindex! at ./Base.jl:0 [inlined]
setindex! at ./subarray.jl:301 [inlined]
partition_sizes! at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:167
#97 at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:158 [inlined]
#3 at ./threadingconstructs.jl:126
unknown function (ip: 0x7f54cc050ee6)
_jl_invoke at /julia/src/gf.c:2144 [inlined]
jl_apply_generic at /julia/src/gf.c:2328
jl_apply at /julia/src/julia.h:1695 [inlined]
start_task at /julia/src/task.c:687
unknown function (ip: (nil))
Allocations: 6273831512 (Pool: 6273798676; Big: 32836); GC: 17817
However, I don't understand how this function partition_sizes!
can cause a segfault.
@chriselrod also observed another segfault with 65b8e7ed72541d1dd5bc1f4dcabc5245e1dba718:
signal (11): Segmentation fault
in expression starting at REPL[3]:1
- at ./int.jl:52 [inlined]
unsafe_length at ./range.jl:517 [inlined]
unsafe_indices at ./abstractarray.jl:99 [inlined]
_indices_sub at ./subarray.jl:409 [inlined]
axes at ./subarray.jl:404 [inlined]
axes1 at ./abstractarray.jl:95 [inlined]
eachindex at ./abstractarray.jl:267 [inlined]
eachindex at ./abstractarray.jl:270 [inlined]
eachindex at ./abstractarray.jl:260 [inlined]
partition_sizes! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:163
unknown function (ip: 0x7f67b598804f)
unknown function (ip: (nil))
Allocations: 10113947179 (Pool: 10113915735; Big: 31444); GC: 29811
It is puzzling for me that how the code involved in the above stacktrace can cause a segfault:
-
at ./int.jl:52
unsafe_length
at ./range.jl:517
unsafe_indices
at ./abstractarray.jl:99
_indices_sub
at ./subarray.jl:409
axes
at ./subarray.jl:404
axes1
at ./abstractarray.jl:95
eachindex
at ./abstractarray.jl:267
eachindex
at ./abstractarray.jl:270
eachindex
at ./abstractarray.jl:260
partition_sizes!
at ThreadsX.jl/src/quicksort.jl:163
We (me and @chriselrod) tried to get a better error report by running the MWE with --check-bounds=yes
. But we couldn't get a hang or segfault even after running the MWE for hours.
I highly appreciate any help/advice from core devs or anyone who knows how threading works in Julia. I'm currently trying to build ASAN-enabled julia
with help from @vchuravy https://github.com/JuliaLang/julia/issues/35338, hoping that it gives us some information. But I'm not sure how much I can discover by doing this as I don't know anything about LLVM or C++.
If you can reproduce it under rr, that'll be easiest to debug.
Thanks! But https://github.com/mozilla/rr/wiki/Usage says
_rr forces your application's threads to execute on a single core_, so your application can see an additional slowdown if it takes advantage of multicore parallelism.
So I guess it cannot be used to debug multi-threading bug? We never observed the bug with single thread.
I tried it anyway but it failed to start
root@459e9488a96a:/julia# rr --version
rr version 5.3.0
root@459e9488a96a:/julia# rr record ./julia -e 'using BenchmarkTools, ThreadsX, Random; seed = Ref(0); while true; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000))); end'
[FATAL /home/roc/rr/rr/src/PerfCounters.cc:317:start_counter() errno: EPERM] Failed to initialize counter
=== Start rr backtrace:
rr(_ZN2rr13dump_rr_stackEv+0x3b)[0x5f229b]
rr(_ZN2rr15notifying_abortEv+0x47)[0x5f2327]
rr[0x52cba5]
rr[0x52dede]
rr(_ZN2rr12PerfCounters23default_ticks_semanticsEv+0x1a)[0x52e9ca]
rr(_ZN2rr7SessionC1Ev+0x139)[0x5bc369]
rr(_ZN2rr13RecordSessionC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorIS6_SaIS6_EESD_RKNS_20DisableCPUIDFeaturesENS0_16SyscallBufferingEiNS_7BindCPUES8_PKNS_9TraceUuidE+0x4d)[0x54069d]
rr(_ZN2rr13RecordSession6createERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EESB_RKNS_20DisableCPUIDFeaturesENS0_16SyscallBufferingEhNS_7BindCPUERKS7_PKNS_9TraceUuidE+0xba9)[0x541769]
rr(_ZN2rr13RecordCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0xe14)[0x535c24]
rr(main+0x353)[0x4ac743]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f935c454b97]
rr(_start+0x29)[0x4acb59]
=== End rr backtrace
Aborted (core dumped)
(Edit: it looks like this is because I'm using docker? https://github.com/mozilla/rr/wiki/Docker)
Ah sorry I missed that https://docs.julialang.org/en/v1/devdocs/debuggingtips/#Reproducing-concurrency-bugs-with-rr-1 mentions rr record --chaos
[FATAL /home/roc/rr/rr/src/PerfCounters.cc:317:start_counter() errno: EPERM] Failed to initialize counter
This might be due to docker, you can try using a priviledged container to get access to the perf counters.
Thanks! I followed https://github.com/mozilla/rr/wiki/Docker and then just started the MWE with
JULIA_NUM_THREADS=13 rr record --num-cores=13 --chaos ./julia -e 'using BenchmarkTools, ThreadsX, Random; seed = Ref(0); while true; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000))); end'
It runs like usual (though of course slowly). Let's hope it hangs.
So I ran this for more than three hours (up to seed = 1000
) and rr
generated 17 GB of trace but it didn't cause a hang or segfault.
If before it hung only after 45 minutes, you'll probably need to run in at least 10 hours with that many threads.you may also want to kick off multiple runs in parallel to maximize your chances of catching it.
The hang can happen within a few minutes if I'm lucky. It's the segfault that is super rare. I was worrying that rr
generates a big amount of data that eats up all the disk space so I limited the loop when I started it yesterday. I know how much space it'll eat now so I'll run more of them.
With ASAN https://github.com/JuliaLang/julia/issues/35338, running the MWE with
JULIA_NUM_THREADS=13 ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1 ./julia -e 'using BenchmarkTools, ThreadsX, Random; for seed in 1:5000; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000))); end'
now somehow gives me the segfault that @chriselrod observed (see the OP) reliably with some short amount of time (a few minutes):
...
seed = 13
signal (11): Segmentation fault
in expression starting at none:1
- at ./int.jl:85 [inlined]
unsafe_length at ./range.jl:540 [inlined]
unsafe_indices at ./abstractarray.jl:99 [inlined]
_indices_sub at ./subarray.jl:409 [inlined]
axes at ./subarray.jl:404 [inlined]
size at ./subarray.jl:71 [inlined]
length at ./abstractarray.jl:206 [inlined]
#89 at ./none:0 [inlined]
iterate at ./generator.jl:47 [inlined]
collect at ./array.jl:664
_quicksort! at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:77
macro expansion at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136 [inlined]
#92 at ./threadingconstructs.jl:126
jl_apply at /julia/src/./julia.h:1687 [inlined]
start_task at /julia/src/task.c:687
unknown function (ip: (nil))
Allocations: 99406804 (Pool: 0; Big: 99406804); GC: 1180
Segmentation fault
Though nothing specific seems to be printed via ASAN runtime itself.
I get different kind of setfault too:
...
12.312 ms (21628 allocations: 19.69 MiB)
seed = 17
signal (11): Segmentation fault
signal (11): Segmentation fault
in expression starting at none:1
in expression starting at none:1
signal (11): Segmentation fault
in expression starting at none:1
in expres11): Segmentation fault
in expression starting at none:1
in expression starting at none:1
signal (11): Segmentation fault
in expression starting at none:1
signal (11): Segmentation fault
signal (11): Segmentation fault
in expression starting at none:1
in expression starting at none:1
in expression starting at none:1
signal (11): Segmentation fault
in expression starting at none:1
signal (11): Segmentation fault
in expression starting at none:1
signal (11): Segmentation fault
in expression starting at none:1
unknown function (ip: 0x7fa637bee020)
unknown function (ip: 0x7fa637bee020)
Segmentation fault
Running without multi-threading (without JULIA_NUM_THREADS=13
) now gives me:
seed = 6
42.666 ms (19937 allocations: 19.42 MiB)
seed = 7
=================================================================
==21519==ERROR: AddressSanitizer: heap-use-after-free on address 0x610017ba6578 at pc 0x7f4ce9302a5b bp 0x7f4caa960570 sp 0x
7f4caa960568
READ of size 8 at 0x610017ba6578 thread T0
#0 0x7f4ce9302a5a (/julia/usr/bin/../lib/libjulia.so.1+0x2b3a5a)
#1 0x7f4ce930672c (/julia/usr/bin/../lib/libjulia.so.1+0x2b772c)
#2 0x7f4ce9304ab0 (/julia/usr/bin/../lib/libjulia.so.1+0x2b5ab0)
#3 0x7f4ce92f9def (/julia/usr/bin/../lib/libjulia.so.1+0x2aadef)
#4 0x7f4cbfb230dc (/memfd:julia-codegen (deleted)+0x260dc)
#5 0x7f4cbfb2c1d4 (/memfd:julia-codegen (deleted)+0x2f1d4)
#6 0x7f4cbfb3009d (/memfd:julia-codegen (deleted)+0x3309d)
#7 0x7f4cbfb32d8b (/memfd:julia-codegen (deleted)+0x35d8b)
#8 0x7f4ce92942b4 (/julia/usr/bin/../lib/libjulia.so.1+0x2452b4)
0x610017ba6578 is located 56 bytes inside of 128-byte region [0x610017ba6540,0x610017ba65c0)
freed by thread T0 here:
#0 0x4a89d0 (/julia/usr/bin/julia+0x4a89d0)
#1 0x7f4ce9310f77 (/julia/usr/bin/../lib/libjulia.so.1+0x2c1f77)
#2 0x7f4ce930e57f (/julia/usr/bin/../lib/libjulia.so.1+0x2bf57f)
#3 0x7f4ce9307cb0 (/julia/usr/bin/../lib/libjulia.so.1+0x2b8cb0)
#4 0x7f4ce9304ab0 (/julia/usr/bin/../lib/libjulia.so.1+0x2b5ab0)
#5 0x7f4ce92f9def (/julia/usr/bin/../lib/libjulia.so.1+0x2aadef)
#6 0x7f4cbfb2bf90 (/memfd:julia-codegen (deleted)+0x2ef90)
#7 0x7f4cbfb3009d (/memfd:julia-codegen (deleted)+0x3309d)
#8 0x7f4cbfb32d8b (/memfd:julia-codegen (deleted)+0x35d8b)
previously allocated by thread T0 here:
#0 0x4a993d (/julia/usr/bin/julia+0x4a993d)
#1 0x7f4ce92f9e4d (/julia/usr/bin/../lib/libjulia.so.1+0x2aae4d)
#2 0x7f4cbfb230dc (/memfd:julia-codegen (deleted)+0x260dc)
SUMMARY: AddressSanitizer: heap-use-after-free (/julia/usr/bin/../lib/libjulia.so.1+0x2b3a5a)
Shadow bytes around the buggy address:
0x0c2082f6cc50: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
0x0c2082f6cc60: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
0x0c2082f6cc70: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
0x0c2082f6cc80: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
0x0c2082f6cc90: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
=>0x0c2082f6cca0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd[fd]
0x0c2082f6ccb0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
0x0c2082f6ccc0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
0x0c2082f6ccd0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
0x0c2082f6cce0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
0x0c2082f6ccf0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
==21519==ABORTING
What's the best next step? I'm trying this with rr record
ATM.
If you use a debug build and set ASAN_SYMBOLIZER_PATH to the llvm-symbolizer you've built in the process, you'll get debug info in the ASAN trace. Also see https://github.com/google/sanitizers/wiki/AddressSanitizerAndDebugger, ASAN_OPTIONS=abort_on_error=1 makes it easier to use ASAN with rr.
Thank you @maleadt!
So here is an output with llvm-symbolizer
. It looks like ASAN is detecting some error from GC?:
root@919704e44cd5:/julia# ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1 ASAN_SYMBOLIZER_PATH=$PWD/usr/tools/llvm-symbolizer usr/bin/julia-debug -e 'using BenchmarkTools, ThreadsX, Random; for seed in 1:5000; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000))); end'
seed = 1
45.948 ms (19980 allocations: 19.43 MiB)
seed = 2
45.620 ms (20313 allocations: 19.48 MiB)
seed = 3
48.309 ms (21386 allocations: 19.66 MiB)
seed = 4
46.958 ms (21283 allocations: 19.63 MiB)
seed = 5
46.954 ms (20681 allocations: 19.54 MiB)
seed = 6
46.379 ms (19937 allocations: 19.42 MiB)
seed = 7
47.260 ms (20512 allocations: 19.52 MiB)
seed = 8
=================================================================
==23720==ERROR: AddressSanitizer: heap-use-after-free on address 0x610000c1ee78 at pc 0x7fd6fa45ff42 bp 0x7fd6c73219e0 sp 0x
7fd6c73219d8
READ of size 8 at 0x610000c1ee78 thread T0
#0 0x7fd6fa45ff41 in gc_try_setmark /julia/src/gc.c:1642:24
#1 0x7fd6fa45f218 in gc_mark_scan_obj8 /julia/src/gc.c:1836:14
#2 0x7fd6fa457224 in gc_mark_loop /julia/src/gc.c:2117:9
#3 0x7fd6fa462bea in _jl_gc_collect /julia/src/gc.c:2903:5
#4 0x7fd6fa462175 in jl_gc_collect /julia/src/gc.c:3109:13
#5 0x7fd6fa454156 in maybe_collect /julia/src/gc.c:827:9
#6 0x7fd6fa453da8 in jl_gc_big_alloc /julia/src/gc.c:883:5
#7 0x7fd6fa45504d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
LLVMSymbolizer: error reading file: No such file or directory
#8 0x7fd6d08185b3 (/memfd:julia-codegen (deleted)+0x325b3)
#9 0x7fd6d0818f32 (/memfd:julia-codegen (deleted)+0x32f32)
#10 0x7fd6d081ed89 (/memfd:julia-codegen (deleted)+0x38d89)
#11 0x7fd6fa32f342 in jl_fptr_args /julia/src/gf.c:2009:12
#12 0x7fd6fa3458a5 in _jl_invoke /julia/src/gf.c:2230:31
#13 0x7fd6fa3474ba in jl_apply_generic /julia/src/gf.c:2414:12
#14 0x7fd6fa3a979d in jl_apply /julia/src/./julia.h:1687:12
#15 0x7fd6fa3ae8fd in start_task /julia/src/task.c:687:19
0x610000c1ee78 is located 56 bytes inside of 128-byte region [0x610000c1ee40,0x610000c1eec0)
freed by thread T0 here:
#0 0x4a8a20 in free /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cc:123
#1 0x7fd6fa46c181 in jl_free_aligned /julia/src/gc.c:255:5
#2 0x7fd6fa46c652 in sweep_big_list /julia/src/gc.c:942:13
#3 0x7fd6fa46bc49 in sweep_big /julia/src/gc.c:954:9
#4 0x7fd6fa46aa7c in gc_sweep_other /julia/src/gc.c:1407:5
#5 0x7fd6fa46341c in _jl_gc_collect /julia/src/gc.c:3013:5
#6 0x7fd6fa462175 in jl_gc_collect /julia/src/gc.c:3109:13
#7 0x7fd6fa454156 in maybe_collect /julia/src/gc.c:827:9
#8 0x7fd6fa453da8 in jl_gc_big_alloc /julia/src/gc.c:883:5
#9 0x7fd6fa45504d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
#10 0x7fd6d0818311 (/memfd:julia-codegen (deleted)+0x32311)
#11 0x7fd6d0818f32 (/memfd:julia-codegen (deleted)+0x32f32)
#12 0x7fd6d081ed89 (/memfd:julia-codegen (deleted)+0x38d89)
#13 0x7fd6fa3458a5 in _jl_invoke /julia/src/gf.c:2230:31
#14 0x7fd6fa3474ba in jl_apply_generic /julia/src/gf.c:2414:12
#15 0x7fd6fa3a979d in jl_apply /julia/src/./julia.h:1687:12
#16 0x7fd6fa3ae8fd in start_task /julia/src/task.c:687:19
previously allocated by thread T0 here:
#0 0x4a998d in posix_memalign /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cc:226
#1 0x7fd6fa45423a in jl_malloc_aligned /julia/src/gc.c:235:9
#2 0x7fd6fa453e6d in jl_gc_big_alloc /julia/src/gc.c:891:30
#3 0x7fd6fa45504d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
#4 0x7fd6d080ed92 (/memfd:julia-codegen (deleted)+0x28d92)
SUMMARY: AddressSanitizer: heap-use-after-free /julia/src/gc.c:1642:24 in gc_try_setmark
Shadow bytes around the buggy address:
0x0c208017bd70: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
0x0c208017bd80: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
0x0c208017bd90: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
0x0c208017bda0: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
0x0c208017bdb0: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
=>0x0c208017bdc0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd[fd]
0x0c208017bdd0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
0x0c208017bde0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
0x0c208017bdf0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
0x0c208017be00: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
0x0c208017be10: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
==23720==ABORTING
Here is a backtrace in gdb
:
root@919704e44cd5:/julia# ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1 ASAN_SYMBOLIZER_PATH=$PWD/usr/tools/llvm-sym
bolizer gdb --args usr/bin/julia-debug -e 'using BenchmarkTools, ThreadsX, Random; for seed in 1:5000; @btime ThreadsX.sort(
$(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000))); end'
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from usr/bin/julia-debug...done.
(gdb) r
Starting program: /julia/usr/bin/julia-debug -e using\ BenchmarkTools,\ ThreadsX,\ Random\;\ for\ seed\ in\ 1:5000\;\ @btime\ ThreadsX.sort\(\$\(rand\(MersenneTwister\(@show\(seed\)\),\ 0:0.01:1,\ 1_000_000\)\)\)\;\ end
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffef933700 (LWP 23800)]
[New Thread 0x7fffe247e700 (LWP 23801)]
[New Thread 0x7fffe1c67700 (LWP 23802)]
[New Thread 0x7fffdf450700 (LWP 23803)]
[New Thread 0x7fffdcc39700 (LWP 23804)]
[New Thread 0x7fffda422700 (LWP 23805)]
[New Thread 0x7fffd7c0b700 (LWP 23806)]
[New Thread 0x7fffd53f4700 (LWP 23807)]
seed = 1
44.897 ms (19980 allocations: 19.43 MiB)
seed = 2
44.839 ms (20313 allocations: 19.48 MiB)
seed = 3
46.575 ms (21386 allocations: 19.66 MiB)
seed = 4
46.976 ms (21283 allocations: 19.63 MiB)
seed = 5
47.982 ms (20681 allocations: 19.54 MiB)
seed = 6
Thread 1 "julia-debug" received signal SIGSEGV, Segmentation fault.
julia_partition_sizes!_569 (xs=<error reading variable: Cannot access memory at address 0xbbbbbbbbbbbbbbbb>, cs=...,
pivot=4581421828931458171, order=...) at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:163
163 @inbounds for i in eachindex(xs, cs)
(gdb) bt
#0 julia_partition_sizes!_569 (xs=<error reading variable: Cannot access memory at address 0xbbbbbbbbbbbbbbbb>, cs=...,
pivot=4581421828931458171, order=...) at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:163
#1 0x00007fffcd643789 in #97 () at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:158
#2 julia_#3_567 () at threadingconstructs.jl:126
#3 0x00007fffcd6437e7 in jfptr_#3_568 ()
#4 0x00007ffff71728a6 in _jl_invoke (F=0x61001445a980, args=0x617000297510, nargs=0, mfunc=0x610000666280, world=27564)
at gf.c:2230
#5 0x00007ffff71744bb in jl_apply_generic (F=0x61001445a980, args=0x617000297510, nargs=0) at gf.c:2414
#6 0x00007ffff71d679e in jl_apply (args=0x617000297508, nargs=1) at ./julia.h:1687
#7 0x00007ffff71db8fe in start_task () at task.c:687
#8 0x0000000000000000 in ?? ()
Does xs=<error reading variable: Cannot access memory at address 0xbbbbbbbbbbbbbbbb>
mean somehow the object xs
is GC'ed prematurely?
ASAN_OPTIONS=abort_on_error=1 makes it easier to use ASAN with rr.
Thanks for the tips! I tried rr
but it didn't even hit the first @show
statement after one hour so I terminated the process.
It looks like I was very lucky at the first runs. I often get a segfault rather than the abort from ASAN:
...
seed = 8
46.673 ms (19959 allocations: 19.42 MiB)
seed = 9
signal (11): Segmentation fault
in expression starting at none:1
- at ./int.jl:85 [inlined]
unsafe_length at ./range.jl:540 [inlined]
unsafe_indices at ./abstractarray.jl:99 [inlined]
_indices_sub at ./subarray.jl:409 [inlined]
axes at ./subarray.jl:404 [inlined]
size at ./subarray.jl:71 [inlined]
length at ./abstractarray.jl:206 [inlined]
#89 at ./none:0 [inlined]
iterate at ./generator.jl:47 [inlined]
collect at ./array.jl:664
_quicksort! at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:77
macro expansion at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136 [inlined]
#92 at ./threadingconstructs.jl:126
jl_fptr_args at /julia/src/gf.c:2009
_jl_invoke at /julia/src/gf.c:2230
jl_apply_generic at /julia/src/gf.c:2414
jl_apply at /julia/src/./julia.h:1687
start_task at /julia/src/task.c:687
unknown function (ip: (nil))
Allocations: 44257146 (Pool: 0; Big: 44257146); GC: 643
Segmentation fault
Does
xs=<error reading variable: Cannot access memory at address 0xbbbbbbbbbbbbbbbb>
mean somehow the objectxs
is GC'ed prematurely?
Potentially, I'd look to see if you're maybe accessing it somewhere without rooting it. That said, an rr trace is still the best way to debug these things, so I'd just kick it off in the background and let it run even if it takes a day or two to crash.
I'd look to see if you're maybe accessing it somewhere without rooting it.
I think the only "unsafe" thing I'm doing is @inbounds
(e.g., there is no unsafe_wrap
). But I'll double-check the code.
I also started rr
again. I'll also try if I can speed it up somehow (it's suspicious that it doesn't even hit the first @show
).
I can try running it myself to see if there's some sort of pathological rr performance problem, but I suspect it's probably just the high thread count.
Thanks! Actually, with ASAN I can reproduce the problem with single thread. Maye rr
is slow with ASAN? Currently I'm using:
ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1:abort_on_error=1 ASAN_SYMBOLIZER_PATH=$PWD/usr/tools/llvm-symbolizer rr record usr/bin/julia-debug -e 'using BenchmarkTools, ThreadsX, Random; for seed in 1:5000; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000))); end'
Without ASAN, rr record ./julia ...
was printing the @show
outputs.
It's possible that there's a bad interaction with rr. Let me take a look.
Thanks! FYI, I needed a bit of manual interventions to build current julia
with ASAN:
https://github.com/JuliaLang/julia/issues/35338#issuecomment-607557859
https://github.com/JuliaLang/julia/issues/35338#issuecomment-607593280
The incompatibility between asan/LLVM/rr is resolved by https://reviews.llvm.org/D70581. You can grab LLVM 10 from Yggdrasil and build with that to get an appropriate version of compiler-rt that has this fix.
Also, if you apply https://github.com/mozilla/rr/pull/2488 to your rr installation, you'll get significantly lower overhead on this workload (2x rather than 4x). Unfortunately, I still can't reproduce the crash here (I saw it once, but didn't have rr attached).
You can grab LLVM 10 from Yggdrasil and build with that to get an appropriate version of compiler-rt that has this fix.
I installed LLVM_full_jll
via ] add LLVM_full_jll@10
and then ran
tooldir=/root/.julia/artifacts/8cd56e3e81e5770720c31304026f3b23f129153e/tools
make clean
ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1 make CC=$tooldir/clang CXX=$tooldir/clang LLVM_CONFIG=$tooldir/llvm-config USECLANG=1 SANITIZE=1
but it failed with the following error:
...
LINK usr/bin/julia
clang-10: warning: argument unused during compilation: '-mllvm -asan-stack=0' [-Wunused-command-line-argument]
/julia/usr/lib/libjulia.so: undefined reference to `llvm::Intrinsic::getIntrinsicInfoTableEntries(unsigned int, llvm::SmallVectorImpl<llvm::Intrinsic::IITDescriptor>&)'
/julia/usr/lib/libjulia.so: undefined reference to `LLVMInitializeWebAssemblyAsmParser'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::IRBuilderBase::CreateMemCpy(llvm::Value*, llvm::MaybeAlign, llvm::Value*, llvm::MaybeAlign, llvm::Value*, bool, llvm::MDNode*, llvm::MDNode*, llvm::MDNode*, llvm::MDNode*)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::LegacyJITSymbolResolver::lookup(std::set<llvm::StringRef, std::less<llvm::StringRef>, std::allocator<llvm::StringRef> > const&, llvm::unique_function<void (llvm::Expected<std::map<llvm::StringRef, llvm::JITEvaluatedSymbol, std::less<llvm::StringRef>, std::allocator<std::pair<llvm::StringRef const, llvm::JITEvaluatedSymbol> > > >)>)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::ExceptionBehaviorToStr(llvm::fp::ExceptionBehavior)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::Intrinsic::getDeclaration(llvm::Module*, unsigned int, llvm::ArrayRef<llvm::Type*>)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::AttrBuilder::addAlignmentAttr(llvm::MaybeAlign)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::GlobalObject::setAlignment(llvm::MaybeAlign)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::Attribute::getWithAlignment(llvm::LLVMContext&, llvm::Align)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::IRBuilderBase::CreateIntrinsic(unsigned int, llvm::ArrayRef<llvm::Type*>, llvm::ArrayRef<llvm::Value*>, llvm::Instruction*, llvm::Twine const&)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::DIBuilder::createTypedef(llvm::DIType*, llvm::StringRef, llvm::DIFile*, unsigned int, llvm::DIScope*, unsigned int)'
/julia/usr/lib/libjulia.so: undefined reference to `LLVMInitializeWebAssemblyTargetMC'
/julia/usr/lib/libjulia.so: undefined reference to `LLVMInitializeWebAssemblyAsmPrinter'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::MachineModuleInfoWrapperPass::MachineModuleInfoWrapperPass(llvm::LLVMTargetMachine const*)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::MCContext::MCContext(llvm::MCAsmInfo const*, llvm::MCRegisterInfo
const*, llvm::MCObjectFileInfo const*, llvm::SourceMgr const*, llvm::MCTargetOptions const*, bool)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::LoadInst::setAlignment(llvm::MaybeAlign)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::Module::global_objects()'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::ConstantExpr::getAsInstruction() const'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::StoreInst::setAlignment(llvm::MaybeAlign)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::AllocaInst::setAlignment(llvm::MaybeAlign)'
/julia/usr/lib/libjulia.so: undefined reference to `LLVMInitializeWebAssemblyTargetInfo'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::RoundingModeToStr(llvm::fp::RoundingMode)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::IRBuilderBase::CreateMemSet(llvm::Value*, llvm::Value*, llvm::Value*, llvm::MaybeAlign, bool, llvm::MDNode*, llvm::MDNode*, llvm::MDNode*)'
/julia/usr/lib/libjulia.so: undefined reference to `LLVMInitializeWebAssemblyTarget'
/julia/usr/lib/libjulia.so: undefined reference to `LLVMInitializeWebAssemblyDisassembler'
clang-10: error: linker command failed with exit code 1 (use -v to see invocation)
Makefile:86: recipe for target '/julia/usr/bin/julia' failed
make[1]: *** [/julia/usr/bin/julia] Error 1
Makefile:78: recipe for target 'julia-ui-release' failed
make: *** [julia-ui-release] Error 2
Is this a correct way to use LLVM from Yggdrasil?
I also tried using
cd $tooldir
ln -s clang-10 clang++
cd -
ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1 PATH="$tooldir:$PATH" make USECLANG=1 SANITIZE=1
but I got the same error.
... PATH="$tooldir:$PATH" make ...
worked after make clean
I'm trying rr+ASAN with rr master
and LLVM 10. I'm now seeing @show
outputs.
I can get the errors under rr
(with --chaos
; maybe it helps) but julia
does not exit on error. Forcefully terminating rr
seems to leave the trace in incomplete state (it leaves ~/.local/share/rr/julia-*/incomplete
).
I tried both with and without abort_on_error=1
to ASAN_OPTIONS
but it didn't change anything. I tried kill
ing julia
with several signals (TERM HUP INT QUIT PIPE) but it didn't help.
You can sigkill Julia or sigterm rr
Thanks. SIGKILL works but it prints some "fatal" messages from rr
:
root@919704e44cd5:/julia# ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1 ASAN_SYMBOLIZER_PATH=$PWD/usr/tools/llvm-symbolizer /obj/bin/rr record --chaos ./julia -e 'using BenchmarkTools, ThreadsX, Random; for seed in 1:100; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000))); end'
rr: Saving execution to trace directory `/root/.local/share/rr/julia-11'.
seed = 1
63.287 ms (19980 allocations: 19.43 MiB)
....
signal (11): Segmentation fault
in expression starting at none:1
- at ./int.jl:85 [inlined]
unsafe_length at ./range.jl:540 [inlined]
unsafe_indices at ./abstractarray.jl:99 [inlined]
_indices_sub at ./subarray.jl:409 [inlined]
axes at ./subarray.jl:404 [inlined]
axes1 at ./abstractarray.jl:95 [inlined]
eachindex at ./abstractarray.jl:267 [inlined]
eachindex at ./abstractarray.jl:270 [inlined]
eachindex at ./abstractarray.jl:260 [inlined]
partition_sizes! at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:163
unknown function (ip: 0x6100130ec97f)
unknown function (ip: (nil))
Allocations: 31615155 (Pool: 0; Big: 31615155); GC: 444
# SIGKILL provided here
[FATAL /rr/src/Task.cc:216:~Task() errno: ECHILD]
(task 7440 (rec:7440) at time 135429)
-> Assertion `syscallbuf_child.is_null()' failed to hold.
Tail of trace dump:
{
real_time:13755396.958021 global_time:135409, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:7440, ticks:46502025650
rax:0x26 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x26 rsi:0x67cc8320 rdi:0x2 rbp:0x1 rsp:0x681ffe10 r8:0x0 r9:0x640000839e
1c r10:0x0 r11:0x246 r12:0x26 r13:0xb r14:0x67965900 r15:0x54dd707a46a0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 e
s:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x54dd707ac280 gs_base:0x0
}
...
=== Start rr backtrace:
/obj/bin/rr(_ZN2rr13dump_rr_stackEv+0x44)[0x557c329a0b1e]
/obj/bin/rr(_ZN2rr9GdbServer15emergency_debugEPNS_4TaskE+0x1a2)[0x557c3281b1b6]
/obj/bin/rr(+0x31b30e)[0x557c3284130e]
/obj/bin/rr(_ZN2rr21EmergencyDebugOstreamD1Ev+0x62)[0x557c32841568]
/obj/bin/rr(_ZN2rr4TaskD1Ev+0xca)[0x557c32963eb8]
/obj/bin/rr(_ZN2rr10RecordTaskD1Ev+0x3a9)[0x557c328e24cd]
/obj/bin/rr(_ZN2rr10RecordTaskD0Ev+0x18)[0x557c328e2500]
/obj/bin/rr(+0x356b44)[0x557c3287cb44]
/obj/bin/rr(_ZN2rr13RecordSession11record_stepEv+0x3ae)[0x557c328864e4]
/obj/bin/rr(+0x352ef0)[0x557c32878ef0]
/obj/bin/rr(_ZN2rr13RecordCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x2d5)[0x557c3
28799bf]
/obj/bin/rr(main+0x21b)[0x557c329b9dde]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7ff32196cb97]
/obj/bin/rr(_start+0x2a)[0x557c3278938a]
=== End rr backtrace
Launch gdb with
gdb '-l' '10000' '-ex' 'set sysroot /' '-ex' 'target extended-remote 127.0.0.1:7440' /julia/usr/bin/julia
^C
root@919704e44cd5:/julia# ps
PID TTY TIME CMD
7205 pts/4 00:00:00 bash
7512 pts/4 00:00:00 ps
Though it seems rr replay
actually works (I can do continue
then bt
in it) so maybe it was OK?
I'm trying the same thing with julia-debug
now. Once it's done, I guess I can rr pack
the trace and upload it somewhere so that you can debug this in your machine?
Here is an rr
trace that exits via an error from ASAN
ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1 ASAN_SYMBOLIZER_PATH=$tooldir/llvm-symbolizer /obj/bin/rr record --chaos usr/bin/julia-debug -e 'using BenchmarkTools, ThreadsX, Random; for seed in 1:100; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000))); end'
Original output
seed = 1
63.822 ms (19980 allocations: 19.43 MiB)
seed = 2
92.539 ms (20313 allocations: 19.48 MiB)
seed = 3
65.948 ms (21386 allocations: 19.66 MiB)
seed = 4
63.588 ms (21283 allocations: 19.63 MiB)
seed = 5
66.618 ms (20681 allocations: 19.54 MiB)
seed = 6
65.059 ms (19937 allocations: 19.42 MiB)
seed = 7
67.628 ms (20512 allocations: 19.52 MiB)
seed = 8
66.372 ms (19959 allocations: 19.42 MiB)
seed = 9
66.207 ms (20755 allocations: 19.55 MiB)
seed = 10
65.702 ms (20001 allocations: 19.43 MiB)
seed = 11
68.869 ms (21644 allocations: 19.69 MiB)
seed = 12
65.738 ms (19931 allocations: 19.43 MiB)
seed = 13
67.790 ms (20943 allocations: 19.59 MiB)
seed = 14
64.881 ms (19992 allocations: 19.43 MiB)
seed = 15
71.427 ms (20245 allocations: 19.48 MiB)
seed = 16
=================================================================
==9500==ERROR: AddressSanitizer: heap-use-after-free on address 0x610017fbfc78 at pc 0x7ffdfd7be312 bp 0x3c9352b6c9e0 sp 0x3
c9352b6c9d8
READ of size 8 at 0x610017fbfc78 thread T0
#0 0x7ffdfd7be311 in gc_try_setmark /julia/src/gc.c:1642:24
#1 0x7ffdfd7bd5e8 in gc_mark_scan_obj8 /julia/src/gc.c:1836:14
#2 0x7ffdfd7b55f4 in gc_mark_loop /julia/src/gc.c:2117:9
#3 0x7ffdfd7c0fba in _jl_gc_collect /julia/src/gc.c:2903:5
#4 0x7ffdfd7c0545 in jl_gc_collect /julia/src/gc.c:3109:13
#5 0x7ffdfd7b2526 in maybe_collect /julia/src/gc.c:827:9
#6 0x7ffdfd7b2178 in jl_gc_big_alloc /julia/src/gc.c:883:5
#7 0x7ffdfd7b341d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
LLVMSymbolizer: error reading file: No such file or directory
#8 0x24d0400ebf06 (/memfd:julia-codegen (deleted)+0x31f06)
#9 0x24d0400ed082 (/memfd:julia-codegen (deleted)+0x33082)
#10 0x24d0400f2ebd (/memfd:julia-codegen (deleted)+0x38ebd)
#11 0x7ffdfd68dd32 in jl_fptr_args /julia/src/gf.c:2009:12
#12 0x7ffdfd6a4295 in _jl_invoke /julia/src/gf.c:2230:31
#13 0x7ffdfd6a5eaa in jl_apply_generic /julia/src/gf.c:2414:12
#14 0x7ffdfd70810d in jl_apply /julia/src/./julia.h:1687:12
#15 0x7ffdfd70d26d in start_task /julia/src/task.c:687:19
0x610017fbfc78 is located 56 bytes inside of 128-byte region [0x610017fbfc40,0x610017fbfcc0)
freed by thread T0 here:
#0 0x4a9fb0 in free /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:123
#1 0x7ffdfd7ca501 in jl_free_aligned /julia/src/gc.c:255:5
#2 0x7ffdfd7ca9d2 in sweep_big_list /julia/src/gc.c:942:13
#3 0x7ffdfd7c9fc9 in sweep_big /julia/src/gc.c:954:9
#4 0x7ffdfd7c8dfc in gc_sweep_other /julia/src/gc.c:1407:5
#5 0x7ffdfd7c17ec in _jl_gc_collect /julia/src/gc.c:3013:5
#6 0x7ffdfd7c0545 in jl_gc_collect /julia/src/gc.c:3109:13
#7 0x7ffdfd7b2526 in maybe_collect /julia/src/gc.c:827:9
#8 0x7ffdfd7b2178 in jl_gc_big_alloc /julia/src/gc.c:883:5
#9 0x7ffdfd7b341d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
#10 0x24d0400ebea1 (/memfd:julia-codegen (deleted)+0x31ea1)
#11 0x24d0400ed082 (/memfd:julia-codegen (deleted)+0x33082)
#12 0x24d0400f2ebd (/memfd:julia-codegen (deleted)+0x38ebd)
#13 0x7ffdfd6a4295 in _jl_invoke /julia/src/gf.c:2230:31
#14 0x7ffdfd6a5eaa in jl_apply_generic /julia/src/gf.c:2414:12
#15 0x7ffdfd70810d in jl_apply /julia/src/./julia.h:1687:12
#16 0x7ffdfd70d26d in start_task /julia/src/task.c:687:19
previously allocated by thread T0 here:
#0 0x4aaf1d in posix_memalign /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:226
#1 0x7ffdfd7b260a in jl_malloc_aligned /julia/src/gc.c:235:9
#2 0x7ffdfd7b223d in jl_gc_big_alloc /julia/src/gc.c:891:30
#3 0x7ffdfd7b341d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
#4 0x24d0400e3312 (/memfd:julia-codegen (deleted)+0x29312)
SUMMARY: AddressSanitizer: heap-use-after-free /julia/src/gc.c:1642:24 in gc_try_setmark
Shadow bytes around the buggy address:
0x0c2082feff30: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
0x0c2082feff40: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
0x0c2082feff50: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
0x0c2082feff60: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
0x0c2082feff70: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
=>0x0c2082feff80: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd[fd]
0x0c2082feff90: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
0x0c2082feffa0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
0x0c2082feffb0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
0x0c2082feffc0: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
0x0c2082feffd0: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
==9500==ABORTING
Here is a trace with abort_on_error=1
. It's maybe easier to look at as you can continue
+ backtrace
. I needed to SIGKILL julia-debug
process after it printed the stacktrace.
ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1:abort_on_error=1 ASAN_SYMBOLIZER_PATH=$tooldir/llvm-symbolizer /obj/bin/rr record --chaos usr/bin/julia-debug -e 'using BenchmarkTools, ThreadsX, Random; for seed in 1:20; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000))); end'
Original output
seed = 1
63.931 ms (19980 allocations: 19.43 MiB)
seed = 2
65.639 ms (20313 allocations: 19.48 MiB)
seed = 3
91.164 ms (21386 allocations: 19.66 MiB)
seed = 4
61.937 ms (21283 allocations: 19.63 MiB)
seed = 5
66.391 ms (20681 allocations: 19.54 MiB)
seed = 6
63.516 ms (19937 allocations: 19.42 MiB)
seed = 7
63.724 ms (20512 allocations: 19.52 MiB)
seed = 8
63.623 ms (19959 allocations: 19.42 MiB)
seed = 9
63.800 ms (20755 allocations: 19.55 MiB)
seed = 10
63.342 ms (20001 allocations: 19.43 MiB)
seed = 11
92.616 ms (21644 allocations: 19.69 MiB)
seed = 12
=================================================================
==10918==ERROR: AddressSanitizer: heap-use-after-free on address 0x61000bc09e78 at pc 0x7fb56b9e0312 bp 0x6576803a49e0 sp 0x
6576803a49d8
READ of size 8 at 0x61000bc09e78 thread T0
#0 0x7fb56b9e0311 in gc_try_setmark /julia/src/gc.c:1642:24
#1 0x7fb56b9df5e8 in gc_mark_scan_obj8 /julia/src/gc.c:1836:14
#2 0x7fb56b9d75f4 in gc_mark_loop /julia/src/gc.c:2117:9
#3 0x7fb56b9e2fba in _jl_gc_collect /julia/src/gc.c:2903:5
#4 0x7fb56b9e2545 in jl_gc_collect /julia/src/gc.c:3109:13
#5 0x7fb56b9d4526 in maybe_collect /julia/src/gc.c:827:9
#6 0x7fb56b9d4178 in jl_gc_big_alloc /julia/src/gc.c:883:5
#7 0x7fb56b9d541d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
LLVMSymbolizer: error reading file: No such file or directory
#8 0x44b532265f37 (/memfd:julia-codegen (deleted)+0x31f37)
#9 0x44b53226a152 (/memfd:julia-codegen (deleted)+0x36152)
#10 0x44b53226ceed (/memfd:julia-codegen (deleted)+0x38eed)
#11 0x7fb56b8afd32 in jl_fptr_args /julia/src/gf.c:2009:12
#12 0x7fb56b8c6295 in _jl_invoke /julia/src/gf.c:2230:31
#13 0x7fb56b8c7eaa in jl_apply_generic /julia/src/gf.c:2414:12
#14 0x7fb56b92a10d in jl_apply /julia/src/./julia.h:1687:12
#15 0x7fb56b92f26d in start_task /julia/src/task.c:687:19
0x61000bc09e78 is located 56 bytes inside of 128-byte region [0x61000bc09e40,0x61000bc09ec0)
freed by thread T0 here:
#0 0x4a9fb0 in free /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:123
#1 0x7fb56b9ec501 in jl_free_aligned /julia/src/gc.c:255:5
#2 0x7fb56b9ec9d2 in sweep_big_list /julia/src/gc.c:942:13
#3 0x7fb56b9ebfc9 in sweep_big /julia/src/gc.c:954:9
#4 0x7fb56b9eadfc in gc_sweep_other /julia/src/gc.c:1407:5
#5 0x7fb56b9e37ec in _jl_gc_collect /julia/src/gc.c:3013:5
#6 0x7fb56b9e2545 in jl_gc_collect /julia/src/gc.c:3109:13
#7 0x7fb56b9d4526 in maybe_collect /julia/src/gc.c:827:9
#8 0x7fb56b9d4178 in jl_gc_big_alloc /julia/src/gc.c:883:5
#9 0x7fb56b9d541d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
#10 0x44b532265ea1 (/memfd:julia-codegen (deleted)+0x31ea1)
#11 0x44b53226a152 (/memfd:julia-codegen (deleted)+0x36152)
#12 0x44b53226ceed (/memfd:julia-codegen (deleted)+0x38eed)
#13 0x7fb56b8c6295 in _jl_invoke /julia/src/gf.c:2230:31
#14 0x7fb56b8c7eaa in jl_apply_generic /julia/src/gf.c:2414:12
#15 0x7fb56b92a10d in jl_apply /julia/src/./julia.h:1687:12
#16 0x7fb56b92f26d in start_task /julia/src/task.c:687:19
previously allocated by thread T0 here:
#0 0x4aaf1d in posix_memalign /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:226
#1 0x7fb56b9d460a in jl_malloc_aligned /julia/src/gc.c:235:9
#2 0x7fb56b9d423d in jl_gc_big_alloc /julia/src/gc.c:891:30
#3 0x7fb56b9d541d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
#4 0x44b53225ced2 (/memfd:julia-codegen (deleted)+0x28ed2)
SUMMARY: AddressSanitizer: heap-use-after-free /julia/src/gc.c:1642:24 in gc_try_setmark
Shadow bytes around the buggy address:
0x0c2081779370: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
0x0c2081779380: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
0x0c2081779390: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
0x0c20817793a0: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
0x0c20817793b0: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
=>0x0c20817793c0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd[fd]
0x0c20817793d0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
0x0c20817793e0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
0x0c20817793f0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
0x0c2081779400: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
0x0c2081779410: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
==10918==ABORTING
signal (6): Aborted
in expression starting at none:1
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
Abort at /workspace/srcdir/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_posix_libcdep.cpp:155
Die at /workspace/srcdir/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_termination.cpp:58
~ScopedInErrorReport at /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_report.cpp:186 [inlined]
ReportGenericError at /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_report.cpp:470
__asan_report_load8 at /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_rtl.cpp:120
gc_try_setmark at /julia/src/gc.c:1642
gc_mark_scan_obj8 at /julia/src/gc.c:1836
gc_mark_loop at /julia/src/gc.c:2117
_jl_gc_collect at /julia/src/gc.c:2903
jl_gc_collect at /julia/src/gc.c:3109
maybe_collect at /julia/src/gc.c:827
jl_gc_big_alloc at /julia/src/gc.c:883
jl_gc_pool_alloc at /julia/src/gc.c:1140
InvasiveLinkedList at ./linked_list.jl:7 [inlined]
GenericCondition at ./condition.jl:67 [inlined]
Task at ./task.jl:5 [inlined]
Task at ./task.jl:5 [inlined]
macro expansion at ./threadingconstructs.jl:130 [inlined]
macro expansion at /root/.julia/packages/ThreadsX/OsJPr/src/utils.jl:50 [inlined]
macro expansion at ./task.jl:334 [inlined]
maptasks at /root/.julia/packages/ThreadsX/OsJPr/src/utils.jl:49
_quicksort! at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:74
macro expansion at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136 [inlined]
#92 at ./threadingconstructs.jl:126
jl_fptr_args at /julia/src/gf.c:2009
_jl_invoke at /julia/src/gf.c:2230
jl_apply_generic at /julia/src/gf.c:2414
jl_apply at /julia/src/./julia.h:1687
start_task at /julia/src/task.c:687
unknown function (ip: (nil))
Allocations: 38364318 (Pool: 0; Big: 38364318); GC: 516
(kill -9
on julia-debug
process here)
[FATAL /rr/src/Task.cc:216:~Task() errno: ECHILD]
(task 10918 (rec:10918) at time 153937)
-> Assertion `syscallbuf_child.is_null()' failed to hold.
Tail of trace dump:
{ real_time:13762576.188770 global_time:153917, event:`EXIT' tid:10922, ticks:1447 { tid:10922, addr:0x794b2e3ce9d0, length:0x4 }
}
{
real_time:13762576.188967 global_time:153918, event:`EXIT' tid:10921, ticks:1419
{ tid:10921, addr:0x22300a59d9d0, length:0x4 }
}
{ real_time:13762576.189084 global_time:153919, event:`EXIT' tid:10920, ticks:2999
{ tid:10920, addr:0x4b0b7a8ce9d0, length:0x4 } } {
real_time:13762576.189195 global_time:153920, event:`SYSCALL: read' (state:EXITING_SYSCALL) tid:10946, ticks:43197650
rax:0xfffffffffffffe00 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1000 rsi:0x78cc20 rdi:0x0 rbp:0x0 rsp:0x681ffe10 r8:0x1 r
9:0x357b35c628d0 r10:0x7f4c7d812480 r11:0x246 r12:0x357b35c5c760 r13:0x357b35c5d2a0 r14:0x357b35c60a00 r15:0x3ff rip:0x70000
002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x0 fs_base:0x7f4c7d812480 gs_base:0x0
}
{
real_time:13762576.189338 global_time:153921, event:`SIGNAL: SIGWINCH(async)' tid:10946, ticks:43197650
rax:0xfffffffffffffe00 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1000 rsi:0x78cc20 rdi:0x0 rbp:0x0 rsp:0x681ffe10 r8:0x1 r
9:0x357b35c628d0 r10:0x7f4c7d812480 r11:0x246 r12:0x357b35c5c760 r13:0x357b35c5d2a0 r14:0x357b35c60a00 r15:0x3ff rip:0x70000
002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x0 fs_base:0x7f4c7d812480 gs_base:0x0
}
{
real_time:13762576.189626 global_time:153922, event:`SIGNAL_DELIVERY: SIGWINCH(async)' tid:10946, ticks:43197650
rax:0x0 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1000 rsi:0x78cc20 rdi:0x0 rbp:0x0 rsp:0x681ffe10 r8:0x1 r9:0x357b35c628d
0 r10:0x7f4c7d812480 r11:0x246 r12:0x357b35c5c760 r13:0x357b35c5d2a0 r14:0x357b35c60a00 r15:0x3ff rip:0x70000000 eflags:0x24
6 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x0 fs_base:0x7f4c7d812480 gs_base:0x0
{ tid:10946, addr:0x681ffe10, length:(nil) }
} {
real_time:13762576.189808 global_time:153923, event:`SYSCALL: restart_syscall' (state:ENTERING_SYSCALL) tid:10946, ticks:4
3197650
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1000 rsi:0x78cc20 rdi:0x0 rbp:0x0 rsp:0x681ffe10 r8:0x1 r
9:0x357b35c628d0 r10:0x7f4c7d812480 r11:0x246 r12:0x357b35c5c760 r13:0x357b35c5d2a0 r14:0x357b35c60a00 r15:0x3ff rip:0x70000
002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x0 fs_base:0x7f4c7d812480 gs_base:0x0
}
{ [49/1819]
real_time:13762576.189924 global_time:153924, event:`EXIT' tid:10919, ticks:2422
{ tid:10919, addr:0x7fb56d1619d0, length:0x4 }
}
{
real_time:13762576.327127 global_time:153925, event:`SYSCALL: restart_syscall' (state:EXITING_SYSCALL) tid:10946, ticks:43
197650
rax:0x0 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1000 rsi:0x78cc20 rdi:0x0 rbp:0x0 rsp:0x681ffe10 r8:0x1 r9:0x357b35c628d
0 r10:0x7f4c7d812480 r11:0x246 r12:0x357b35c5c760 r13:0x357b35c5d2a0 r14:0x357b35c60a00 r15:0x3ff rip:0x70000002 eflags:0x24
6 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x0 fs_base:0x7f4c7d812480 gs_base:0x0
{ tid:10946, addr:0x78cc20, length:(nil) }
}
{
real_time:13762576.327828 global_time:153926, event:`SYSCALL: munmap' (state:ENTERING_SYSCALL) tid:10946, ticks:43198908
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x749010 rsi:0x926bb8 rdi:0x2ef317b59000 rbp:0xb rsp:0x681f
fe10 r8:0x1 r9:0x357b35c628d0 r10:0xb r11:0x246 r12:0x78dd40 r13:0x7ffed1f51820 r14:0x7ffed1f51660 r15:0x3a rip:0x70000002 e
flags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xb fs_base:0x7f4c7d812480 gs_base:0x0
}
{
real_time:13762576.328012 global_time:153927, event:`SYSCALL: munmap' (state:EXITING_SYSCALL) tid:10946, ticks:43198908
rax:0x0 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x749010 rsi:0x926bb8 rdi:0x2ef317b59000 rbp:0xb rsp:0x681ffe10 r8:0x1 r9:
0x357b35c628d0 r10:0xb r11:0x246 r12:0x78dd40 r13:0x7ffed1f51820 r14:0x7ffed1f51660 r15:0x3a rip:0x70000002 eflags:0x246 cs:
0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xb fs_base:0x7f4c7d812480 gs_base:0x0
}
{
real_time:13762576.328331 global_time:153928, event:`SYSCALL: munmap' (state:ENTERING_SYSCALL) tid:10946, ticks:43198995
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x7b8670 rsi:0x1e4b9f0 rdi:0x2aca4765e000 rbp:0xb rsp:0x681
ffe10 r8:0x1 r9:0x357b35c628d0 r10:0xb r11:0x246 r12:0x0 r13:0x7ffed1f51820 r14:0x7ffed1f51660 r15:0x3a rip:0x70000002 eflag
s:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xb fs_base:0x7f4c7d812480 gs_base:0x0
}
{
real_time:13762576.328642 global_time:153929, event:`SYSCALL: munmap' (state:EXITING_SYSCALL) tid:10946, ticks:43198995
rax:0x0 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x7b8670 rsi:0x1e4b9f0 rdi:0x2aca4765e000 rbp:0xb rsp:0x681ffe10 r8:0x1 r9
:0x357b35c628d0 r10:0xb r11:0x246 r12:0x0 r13:0x7ffed1f51820 r14:0x7ffed1f51660 r15:0x3a rip:0x70000002 eflags:0x246 cs:0x33
ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xb fs_base:0x7f4c7d812480 gs_base:0x0
}
{
real_time:13762576.329662 global_time:153930, event:`SYSCALL: munmap' (state:ENTERING_SYSCALL) tid:10946, ticks:43342828
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x6ffff000 rsi:0x101000 rdi:0x6feff000 rbp:0xb rsp:0x681ffe
10 r8:0x31 r9:0xbdb800 r10:0x357b35c60ca0 r11:0x246 r12:0x6feff010 r13:0x6feff000 r14:0x7ffed1f51660 r15:0x3a rip:0x70000002
eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xb fs_base:0x7f4c7d812480 gs_base:0x0
}
{
real_time:13762576.329774 global_time:153931, event:`SYSCALL: munmap' (state:EXITING_SYSCALL) tid:10946, ticks:43342828
rax:0x0 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x6ffff000 rsi:0x101000 rdi:0x6feff000 rbp:0xb rsp:0x681ffe10 r8:0x31 r9:0
xbdb800 r10:0x357b35c60ca0 r11:0x246 r12:0x6feff010 r13:0x6feff000 r14:0x7ffed1f51660 r15:0x3a rip:0x70000002 eflags:0x246 c
s:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xb fs_base:0x7f4c7d812480 gs_base:0x0
}
{
real_time:13762576.330122 global_time:153932, event:`SYSCALL: munmap' (state:ENTERING_SYSCALL) tid:10946, ticks:43343837
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x26ca49fc5000 rsi:0x181000 rdi:0x26ca49f45000 rbp:0xb rsp:
0x681ffe10 r8:0x80 r9:0x80 r10:0x357b35c60ca0 r11:0x246 r12:0x26ca49f45010 r13:0x26ca49f45000 r14:0x7ffed1f51660 r15:0x3a ri
p:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xb fs_base:0x7f4c7d812480 gs_base:0x0
}
{
real_time:13762576.330258 global_time:153933, event:`SYSCALL: munmap' (state:EXITING_SYSCALL) tid:10946, ticks:43343837
rax:0x0 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x26ca49fc5000 rsi:0x181000 rdi:0x26ca49f45000 rbp:0xb rsp:0x681ffe10 r8:0
x80 r9:0x80 r10:0x357b35c60ca0 r11:0x246 r12:0x26ca49f45010 r13:0x26ca49f45000 r14:0x7ffed1f51660 r15:0x3a rip:0x70000002 ef
lags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xb fs_base:0x7f4c7d812480 gs_base:0x0
}
{
real_time:13762576.331327 global_time:153934, event:`PATCH_SYSCALL' tid:10946, ticks:43517084
rax:0xe7 rbx:0x357b35c5c740 rcx:0xffffffffffffffff rdx:0x0 rsi:0x3c rdi:0x0 rbp:0x0 rsp:0x7ffed1f51cf8 r8:0xe7 r9:0xffffffff
ffffff50 r10:0x461a3c8f02a0 r11:0x246 r12:0x357b35c5c740 r13:0x478 r14:0x357b35c65628 r15:0x0 rip:0x357b35959e04 eflags:0x24
6 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f4c7d812480 gs_base:0x0
{ tid:10946, addr:0x357b35c671da, length:0x4f }
{ tid:10946, addr:0x357b35959e04, length:0x5 }
{ tid:10946, addr:0x357b35959e09, length:0x3 }
}
{
real_time:13762576.332061 global_time:153935, event:`SYSCALL: exit_group' (state:ENTERING_SYSCALL) tid:10946, ticks:435170
90
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x0 rsi:0x3c rdi:0x0 rbp:0xe7 rsp:0x681ffe10 r8:0xe7 r9:0xf
fffffffffffff50 r10:0x461a3c8f02a0 r11:0x246 r12:0x357b35c5c740 r13:0x478 r14:0x357b35c65628 r15:0x0 rip:0x70000002 eflags:0
x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xe7 fs_base:0x7f4c7d812480 gs_base:0x0
{ tid:10946, addr:0x21ce535bb12c, length:0x1 }
{ tid:10946, addr:0x21ce535bb3b2, length:0x1 }
}
{
real_time:13762576.333597 global_time:153936, event:`EXIT' tid:10946, ticks:43517090
}
=== Start rr backtrace:
/obj/bin/rr(_ZN2rr13dump_rr_stackEv+0x44)[0x55f453600b1e]
/obj/bin/rr(_ZN2rr9GdbServer15emergency_debugEPNS_4TaskE+0x1a2)[0x55f45347b1b6]
/obj/bin/rr(+0x31b30e)[0x55f4534a130e]
/obj/bin/rr(_ZN2rr21EmergencyDebugOstreamD1Ev+0x62)[0x55f4534a1568]
/obj/bin/rr(_ZN2rr4TaskD1Ev+0xca)[0x55f4535c3eb8]
/obj/bin/rr(_ZN2rr10RecordTaskD1Ev+0x3a9)[0x55f4535424cd]
/obj/bin/rr(_ZN2rr10RecordTaskD0Ev+0x18)[0x55f453542500]
/obj/bin/rr(+0x356b44)[0x55f4534dcb44]
/obj/bin/rr(_ZN2rr13RecordSession11record_stepEv+0x3ae)[0x55f4534e64e4]
/obj/bin/rr(+0x352ef0)[0x55f4534d8ef0]
/obj/bin/rr(_ZN2rr13RecordCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x2d5)[0x55f4534d99bf]
/obj/bin/rr(main+0x21b)[0x55f453619dde]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f193ff66b97]
/obj/bin/rr(_start+0x2a)[0x55f4533e938a]
=== End rr backtrace
Launch gdb with
gdb '-l' '10000' '-ex' 'set sysroot /' '-ex' 'target extended-remote 127.0.0.1:10918' /julia/usr/bin/julia-debug
^C
Backtrace
(rr) backtrace
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00005ff86156a801 in __GI_abort () at abort.c:79
#2 0x00000000004d88eb in __sanitizer::Abort ()
at /workspace/srcdir/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_posix_libcdep.cpp:155
#3 0x00000000004d6b48 in __sanitizer::Die ()
at /workspace/srcdir/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_termination.cpp:58
#4 0x00000000004b765e in __asan::ScopedInErrorReport::~ScopedInErrorReport (this=0x6576803a3d76,
__in_chrg=<optimized out>) at /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_report.cpp:186
#5 __asan::ReportGenericError (pc=<optimized out>, bp=bp@entry=111559631849952, sp=sp@entry=111559631849944,
addr=<optimized out>, is_write=is_write@entry=false, access_size=access_size@entry=8, exp=0, fatal=true)
at /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_report.cpp:470
#6 0x00000000004b8018 in __asan::__asan_report_load8 (addr=<optimized out>)
at /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_rtl.cpp:120
#7 0x00007fb56b9e0312 in gc_try_setmark (obj=0x61000bc09e80, nptr=0x784220c298b0, ptag=0x6576803a6298,
pbits=0x6576803a6297 "\001\200`(\004\020a") at gc.c:1642
#8 0x00007fb56b9df5e9 in gc_mark_scan_obj8 (ptls=0x404919615520, sp=0x6576803a62c0, obj8=0x784220c29898,
parent=0x61000bc0a280 "\200\236\300\v", begin=0x6030001cddb4 "", end=0x6030001cddb6 "\276\276\064",
pnew_obj=0x6576803a62a0, ptag=0x6576803a6298, pbits=0x6576803a6297 "\001\200`(\004\020a") at gc.c:1836
#9 0x00007fb56b9d75f5 in gc_mark_loop (ptls=0x404919615520, sp=...) at gc.c:2117
#10 0x00007fb56b9e2fbb in _jl_gc_collect (ptls=0x404919615520, collection=JL_GC_AUTO) at gc.c:2903
#11 0x00007fb56b9e2546 in jl_gc_collect (collection=JL_GC_AUTO) at gc.c:3109
#12 0x00007fb56b9d4527 in maybe_collect (ptls=0x404919615520) at gc.c:827
#13 0x00007fb56b9d4179 in jl_gc_big_alloc (ptls=0x404919615520, sz=32) at gc.c:883
#14 0x00007fb56b9d541e in jl_gc_pool_alloc (ptls=0x404919615520, pool_offset=1424, osize=32) at gc.c:1140
#15 0x000044b532265f38 in InvasiveLinkedList () at linked_list.jl:7
#16 GenericCondition () at condition.jl:67
#17 Task () at task.jl:5
#18 Task () at task.jl:5
#19 macro expansion () at threadingconstructs.jl:130
#20 macro expansion () at /root/.julia/packages/ThreadsX/OsJPr/src/utils.jl:50
#21 macro expansion () at task.jl:334
#22 julia_maptasks_612 (f=<error reading variable: Cannot access memory at address 0xbbbbbbbbd2cc463b>, xs=...)
at /root/.julia/packages/ThreadsX/OsJPr/src/utils.jl:49
#23 0x000044b53226a153 in julia__quicksort!_615 (ys=..., xs=..., alg=..., order=..., cs=..., ys_is_result=0 '\000',
mutable_xs=1 '\001') at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:74
#24 0x000044b53226ceee in macro expansion () at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136
#25 japi1_#92_589 () at threadingconstructs.jl:126
#26 0x00007fb56b8afd33 in jl_fptr_args (f=0x61000bbf2580, args=0x617000cd0010, nargs=0, m=0x6110040cbd80) at gf.c:2009
#27 0x00007fb56b8c6296 in _jl_invoke (F=0x61000bbf2580, args=0x617000cd0010, nargs=0, mfunc=0x61001a14f380, world=27606)
at gf.c:2230
#28 0x00007fb56b8c7eab in jl_apply_generic (F=0x61000bbf2580, args=0x617000cd0010, nargs=0) at gf.c:2414
#29 0x00007fb56b92a10e in jl_apply (args=0x617000cd0008, nargs=1) at ./julia.h:1687
#30 0x00007fb56b92f26e in start_task () at task.c:687
#31 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x6576803a7000
Here is a trace where julia-debug
got segmentation fault before ASAN printing anything. I think this failure mode is actually more frequent. I needed to SIGKILL julia-debug
process after it printed the stacktrace.
ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1:abort_on_error=1 ASAN_SYMBOLIZER_PATH=$tooldir/llvm-symbolizer /obj/bin/rr record --chaos usr/bin/julia-debug -e 'using BenchmarkTools, ThreadsX, Random; for seed in 1:100; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000))); end'
Original output
rr: Saving execution to trace directory `/root/.local/share/rr/julia-debug-5'.
seed = 1
68.843 ms (19980 allocations: 19.43 MiB)
seed = 2
68.356 ms (20313 allocations: 19.48 MiB)
seed = 3
76.487 ms (21386 allocations: 19.66 MiB)
seed = 4
67.958 ms (21283 allocations: 19.63 MiB)
seed = 5
67.588 ms (20681 allocations: 19.54 MiB)
seed = 6
64.807 ms (19937 allocations: 19.42 MiB)
seed = 7
signal (11): Segmentation fault
in expression starting at none:1
- at ./int.jl:85 [inlined]
unsafe_length at ./range.jl:540 [inlined]
unsafe_indices at ./abstractarray.jl:99 [inlined]
_indices_sub at ./subarray.jl:409 [inlined]
axes at ./subarray.jl:404 [inlined]
size at ./subarray.jl:71 [inlined]
length at ./abstractarray.jl:206 [inlined]
#89 at ./none:0 [inlined]
iterate at ./generator.jl:47 [inlined]
collect at ./array.jl:664
_quicksort! at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:77
macro expansion at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136 [inlined]
#92 at ./threadingconstructs.jl:126
jl_fptr_args at /julia/src/gf.c:2009
_jl_invoke at /julia/src/gf.c:2230
jl_apply_generic at /julia/src/gf.c:2414
jl_apply at /julia/src/./julia.h:1687
start_task at /julia/src/task.c:687
unknown function (ip: (nil))
Allocations: 25924611 (Pool: 0; Big: 25924611); GC: 312
[FATAL /rr/src/Task.cc:216:~Task() errno: ECHILD]
(task 9537 (rec:9537) at time 106741)
-> Assertion `syscallbuf_child.is_null()' failed to hold.
Tail of trace dump:
{
real_time:13759625.102632 global_time:106721, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:9535, ticks:51520240909
rax:0x29 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x29 rsi:0x75a06ad39ae0 rdi:0x2 rbp:0x1 rsp:0x681ffe10 r8:0x0 r9:0x73 r10
:0x0 r11:0x246 r12:0x29 r13:0x6100092d5c80 r14:0x610009318c80 r15:0x657c01cb1a00 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b
ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x736a3190a280 gs_base:0x0
}
{
real_time:13759625.114627 global_time:106722, event:`SYSCALL: mmap' (state:ENTERING_SYSCALL) tid:9535, ticks:51525917273
rax:0xffffffffffffffda rbx:0x22 rcx:0xffffffffffffffff rdx:0x3 rsi:0x22000 rdi:0x0 rbp:0xffffffffffffffff rsp:0x7f0b77a2bb20
r8:0xffffffffffffffff r9:0x0 r10:0x22 r11:0x246 r12:0x0 r13:0x800 r14:0x21000 r15:0x7 rip:0x4cf5b0 eflags:0x246 cs:0x33 ss:
0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x9 fs_base:0x736a3190a280 gs_base:0x0
}
{
real_time:13759625.114699 global_time:106723, event:`SYSCALL: mmap' (state:EXITING_SYSCALL) tid:9535, ticks:51525917273
rax:0x2904276fd000 rbx:0x22 rcx:0xffffffffffffffff rdx:0x3 rsi:0x22000 rdi:0x0 rbp:0xffffffffffffffff rsp:0x7f0b77a2bb20 r8:
0xffffffffffffffff r9:0x0 r10:0x22 r11:0x246 r12:0x0 r13:0x800 r14:0x21000 r15:0x7 rip:0x4cf5b0 eflags:0x246 cs:0x33 ss:0x2b
ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x9 fs_base:0x736a3190a280 gs_base:0x0
{ map_file:"<ZERO>", addr:0x2904276fd000, length:0x22000, prot_flags:"rw-p", file_offset:0x0, device:0, inode:0, data_file
:"", data_offset:0x0, file_size:0x22000 }
}
{
real_time:13759625.115101 global_time:106724, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:9535, ticks:51525943018
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x26 rsi:0x75a06ad39ae0 rdi:0x2 rbp:0x1 rsp:0x681ffe10 r8:0
x0 r9:0x73 r10:0x0 r11:0x246 r12:0x26 r13:0x6100092d5c80 r14:0x610009318c80 r15:0x657c01cb1a00 rip:0x70000002 eflags:0x246 c
s:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x736a3190a280 gs_base:0x0
}
{
real_time:13759625.115153 global_time:106725, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:9535, ticks:51525943018
rax:0x26 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x26 rsi:0x75a06ad39ae0 rdi:0x2 rbp:0x1 rsp:0x681ffe10 r8:0x0 r9:0x73 r10
:0x0 r11:0x246 r12:0x26 r13:0x6100092d5c80 r14:0x610009318c80 r15:0x657c01cb1a00 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b
ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x736a3190a280 gs_base:0x0
}
{
real_time:13759625.129024 global_time:106726, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:9535, ticks:51531499743
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x24 rsi:0x75a06ad39ae0 rdi:0x2 rbp:0x1 rsp:0x681ffe10 r8:0
x0 r9:0x73 r10:0x0 r11:0x246 r12:0x24 r13:0x6100092d5c80 r14:0x610009318c80 r15:0x657c01cb1a00 rip:0x70000002 eflags:0x246 c
s:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x736a3190a280 gs_base:0x0
}
{
real_time:13759625.129077 global_time:106727, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:9535, ticks:51531499743
rax:0x24 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x24 rsi:0x75a06ad39ae0 rdi:0x2 rbp:0x1 rsp:0x681ffe10 r8:0x0 r9:0x73 r10
:0x0 r11:0x246 r12:0x24 r13:0x6100092d5c80 r14:0x610009318c80 r15:0x657c01cb1a00 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b
ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x736a3190a280 gs_base:0x0
}
{
real_time:13759625.129364 global_time:106728, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:9535, ticks:51531500924
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1d rsi:0x75a06ad39ae0 rdi:0x2 rbp:0x1 rsp:0x681ffe10 r8:0
x0 r9:0x70 r10:0x0 r11:0x246 r12:0x1d r13:0x6100092d5c80 r14:0x610009318c80 r15:0x657c01cb1a00 rip:0x70000002 eflags:0x246 c
s:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x736a3190a280 gs_base:0x0
}
{
real_time:13759625.129415 global_time:106729, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:9535, ticks:51531500924
rax:0x1d rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1d rsi:0x75a06ad39ae0 rdi:0x2 rbp:0x1 rsp:0x681ffe10 r8:0x0 r9:0x70 r10
:0x0 r11:0x246 r12:0x1d r13:0x6100092d5c80 r14:0x610009318c80 r15:0x657c01cb1a00 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b
ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x736a3190a280 gs_base:0x0
}
{ [133/1936]
real_time:13759625.129691 global_time:106730, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:9535, ticks:51531502238
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x38 rsi:0x75a06ad39ae0 rdi:0x2 rbp:0x1 rsp:0x681ffe10 r8:0
x0 r9:0x7f0b77a2cc35 r10:0x0 r11:0x246 r12:0x38 r13:0x6100092d5c80 r14:0x610009318c80 r15:0x657c01cb1a00 rip:0x70000002 efla
gs:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x736a3190a280 gs_base:0x0
}
{
real_time:13759625.129741 global_time:106731, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:9535, ticks:51531502238
rax:0x38 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x38 rsi:0x75a06ad39ae0 rdi:0x2 rbp:0x1 rsp:0x681ffe10 r8:0x0 r9:0x7f0b77
a2cc35 r10:0x0 r11:0x246 r12:0x38 r13:0x6100092d5c80 r14:0x610009318c80 r15:0x657c01cb1a00 rip:0x70000002 eflags:0x246 cs:0x
33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x736a3190a280 gs_base:0x0
}
{
real_time:13759625.130012 global_time:106732, event:`SYSCALLBUF_FLUSH' tid:9535, ticks:51531502400
{ syscall:'rt_sigprocmask', ret:0x0, size:0x18 }
}
{
real_time:13759625.130038 global_time:106733, event:`SYSCALL: rt_sigaction' (state:ENTERING_SYSCALL) tid:9535, ticks:51531
502400
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x7f0b77a2d720 rsi:0x7f0b77a2d680 rdi:0xb rbp:0xd rsp:0x681
ffe10 r8:0x7f0b77a2d870 r9:0x17 r10:0x8 r11:0x246 r12:0x736a3190a280 r13:0x6100092d5c80 r14:0x610009318c80 r15:0x657c01cb1a0
0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xd fs_base:0x736a3190a280 gs_base:0x0
}
{
real_time:13759625.130048 global_time:106734, event:`SYSCALLBUF_RESET' tid:9535, ticks:51531502400
}
{
real_time:13759625.130105 global_time:106735, event:`SYSCALL: rt_sigaction' (state:EXITING_SYSCALL) tid:9535, ticks:515315
02400
rax:0x0 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x7f0b77a2d720 rsi:0x7f0b77a2d680 rdi:0xb rbp:0xd rsp:0x681ffe10 r8:0x7f0b
77a2d870 r9:0x17 r10:0x8 r11:0x246 r12:0x736a3190a280 r13:0x6100092d5c80 r14:0x610009318c80 r15:0x657c01cb1a00 rip:0x7000000
2 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xd fs_base:0x736a3190a280 gs_base:0x0
{ tid:9535, addr:0x7f0b77a2d680, length:0x20 }
{ tid:9535, addr:0x7f0b77a2d720, length:0x20 }
}
{
real_time:13759625.130400 global_time:106736, event:`SYSCALL: rt_sigreturn' (state:ENTERING_SYSCALL) tid:9535, ticks:51531
502412
rax:0xffffffffffffffda rbx:0x610009318e80 rcx:0xffffffffffffffff rdx:0x75a06a55aab0 rsi:0x7f0b77a2d680 rdi:0x0 rbp:0x55f97fd
d5890 rsp:0x7f0b77a2dac0 r8:0x7f0b77a2d870 r9:0x17 r10:0x8 r11:0x246 r12:0x736a3190a280 r13:0x6100092d5c80 r14:0x610009318c8
0 r15:0x657c01cb1a00 rip:0x6768714da899 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xf fs_base:0x736a
3190a280 gs_base:0x0
}
{ [90/1936]
real_time:13759625.130446 global_time:106737, event:`SYSCALL: rt_sigreturn' (state:EXITING_SYSCALL) tid:9535, ticks:515315
02412
rax:0x610003d6a580 rbx:0x610009318e80 rcx:0xbbbbbbbbbbbbbbbb rdx:0xbbbbbbbbbbbbbbbb rsi:0xefc10e3067cd8800 rdi:0xc20007ad4a9
rbp:0x55f97fdd5890 rsp:0x55f97fdd5800 r8:0xc20807a54a8 r9:0x1 r10:0xc20807a54b8 r11:0xc20807a54a8 r12:0x736a3190a280 r13:0x
6100092d5c80 r14:0x610009318c80 r15:0x657c01cb1a00 rip:0x51a16dc0872e eflags:0x10283 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs
:0x0 orig_rax:0xffffffffffffffff fs_base:0x736a3190a280 gs_base:0x0 st0:0x0 st1:0x0 st2:0x0 st3:0x0 st4:0x0 st5:0x4000c00000
0000000000 st6:0x4000c000000000000000 st7:0x4001e000000000000000 ymm0:0xeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee ymm1:0x3857181fc000
0000001da620b ymm2:0x1900000736a00000001 ymm3:0x61a0002793e0000061a0002790c0 ymm4:0x0 ymm5:0x0 ymm6:0x0 ymm7:0x0 ymm8:0xc000
000000000000c ymm9:0x0 ymm10:0xbfef0a3d70a3d70abfef0a3d70a3d70a ymm11:0xffffffffffffffffffffffffffffffff ymm12:0x0 ymm13:0x3
fef0a3d70a3d70a3fef0a3d70a3d70a ymm14:0xbfef0a3d70a3d70abfef0a3d70a3d70a ymm15:0x10000000000000001
}
{
real_time:13759625.130638 global_time:106738, event:`SIGNAL: SIGSEGV(det)' tid:9535, ticks:51531502412
rax:0x610003d6a580 rbx:0x610009318e80 rcx:0xbbbbbbbbbbbbbbbb rdx:0xbbbbbbbbbbbbbbbb rsi:0xefc10e3067cd8800 rdi:0xc20007ad4a9
rbp:0x55f97fdd5890 rsp:0x55f97fdd5800 r8:0xc20807a54a8 r9:0x1 r10:0xc20807a54b8 r11:0xc20807a54a8 r12:0x736a3190a280 r13:0x
6100092d5c80 r14:0x610009318c80 r15:0x657c01cb1a00 rip:0x51a16dc0872e eflags:0x10283 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs
:0x0 orig_rax:0xffffffffffffffff fs_base:0x736a3190a280 gs_base:0x0
}
{
real_time:13759625.130745 global_time:106739, event:`SIGNAL_DELIVERY: SIGSEGV(det)' tid:9535, ticks:51531502412
rax:0x610003d6a580 rbx:0x610009318e80 rcx:0xbbbbbbbbbbbbbbbb rdx:0xbbbbbbbbbbbbbbbb rsi:0xefc10e3067cd8800 rdi:0xc20007ad4a9
rbp:0x55f97fdd5890 rsp:0x55f97fdd5800 r8:0xc20807a54a8 r9:0x1 r10:0xc20807a54b8 r11:0xc20807a54a8 r12:0x736a3190a280 r13:0x
6100092d5c80 r14:0x610009318c80 r15:0x657c01cb1a00 rip:0x51a16dc0872e eflags:0x10283 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs
:0x0 orig_rax:0xffffffffffffffff fs_base:0x736a3190a280 gs_base:0x0
{ tid:9535, addr:0x55f97fdd5800, length:(nil) }
}
{
real_time:13760099.755174 global_time:106740, event:`EXIT' tid:9538, ticks:2769
{ tid:9538, addr:0x40b901fd29d0, length:0x4 }
}
=== Start rr backtrace:
/obj/bin/rr(_ZN2rr13dump_rr_stackEv+0x44)[0x561601026b1e]
/obj/bin/rr(_ZN2rr9GdbServer15emergency_debugEPNS_4TaskE+0x1a2)[0x561600ea11b6]
/obj/bin/rr(+0x31b30e)[0x561600ec730e]
/obj/bin/rr(_ZN2rr21EmergencyDebugOstreamD1Ev+0x62)[0x561600ec7568]
/obj/bin/rr(_ZN2rr4TaskD1Ev+0xca)[0x561600fe9eb8]
/obj/bin/rr(_ZN2rr10RecordTaskD1Ev+0x3a9)[0x561600f684cd]
/obj/bin/rr(_ZN2rr10RecordTaskD0Ev+0x18)[0x561600f68500]
/obj/bin/rr(+0x356b44)[0x561600f02b44]
/obj/bin/rr(_ZN2rr13RecordSession11record_stepEv+0x3ae)[0x561600f0c4e4]
/obj/bin/rr(+0x352ef0)[0x561600efeef0]
/obj/bin/rr(_ZN2rr13RecordCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x2d5)[0x56160
0eff9bf]
/obj/bin/rr(main+0x21b)[0x56160103fdde]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f697fd39b97]
/obj/bin/rr(_start+0x2a)[0x561600e0f38a]
=== End rr backtrace
Launch gdb with
gdb '-l' '10000' '-ex' 'set sysroot /' '-ex' 'target extended-remote 127.0.0.1:9537' /julia/usr/bin/julia-debug
Received SIGTERM while an earlier one was pending. We're probably wedged.
=== Start rr backtrace:
/obj/bin/rr(_ZN2rr13dump_rr_stackEv+0x44)[0x561601026b1e]
/obj/bin/rr(_ZN2rr15notifying_abortEv+0x53)[0x561601026ad5]
/obj/bin/rr(+0x3524f1)[0x561600efe4f1]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f6980c8a890]
/lib/x86_64-linux-gnu/libc.so.6(__poll+0x49)[0x7f697fe2cbf9]
/obj/bin/rr(+0x2d75a5)[0x561600e835a5]
/obj/bin/rr(+0x2d765e)[0x561600e8365e]
/obj/bin/rr(_ZN2rr13GdbConnection14read_data_onceEv+0x3e)[0x561600e836c4]
/obj/bin/rr(_ZN2rr13GdbConnection11read_packetEv+0x4b)[0x561600e84441]
/obj/bin/rr(_ZN2rr13GdbConnection11get_requestEv+0x1a4)[0x561600e8a092]
/obj/bin/rr(_ZN2rr9GdbServer25process_debugger_requestsENS0_11ReportStateE+0x59)[0x561600e9d21d]
/obj/bin/rr(_ZN2rr9GdbServer15emergency_debugEPNS_4TaskE+0x255)[0x561600ea1269]
/obj/bin/rr(+0x31b30e)[0x561600ec730e]
/obj/bin/rr(_ZN2rr21EmergencyDebugOstreamD1Ev+0x62)[0x561600ec7568]
/obj/bin/rr(_ZN2rr4TaskD1Ev+0xca)[0x561600fe9eb8]
/obj/bin/rr(_ZN2rr10RecordTaskD1Ev+0x3a9)[0x561600f684cd]
/obj/bin/rr(_ZN2rr10RecordTaskD0Ev+0x18)[0x561600f68500]
/obj/bin/rr(+0x356b44)[0x561600f02b44]
/obj/bin/rr(_ZN2rr13RecordSession11record_stepEv+0x3ae)[0x561600f0c4e4]
/obj/bin/rr(+0x352ef0)[0x561600efeef0]
/obj/bin/rr(_ZN2rr13RecordCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x2d5)[0x56160
0eff9bf]
/obj/bin/rr(main+0x21b)[0x56160103fdde]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f697fd39b97]
/obj/bin/rr(_start+0x2a)[0x561600e0f38a]
=== End rr backtrace
Aborted (core dumped)
Backtrace
(rr) bt
#0 iterate () at generator.jl:47
#1 japi1_collect_605 (itr=...) at array.jl:664
#3 0x000051a16dc0eece in macro expansion () at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136=1 '\001',
#6 0x000075a06a410296 in _jl_invoke (F=0x6100092c3880, args=0x617000b2a790, nargs=0, mfunc=0x61001102ae80, world=27571)
#8 0x000075a06a47410e in jl_apply (args=0x617000b2a788, nargs=1) at ./julia.h:1687s=0, m=0x6110046ee280) at gf.c:2009
#9 0x000075a06a47926e in start_task () at task.c:68792c3880, args=0x617000b2a790, nargs=0) at gf.c:2414
#10 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x55f97fdd6000
Hmm... I tried rr replay
in my laptop but it didn't work (it didn't reproduce the execution like I can do in remote docker).
$ rr replay julia-debug-8/
Trace XCR0 value 0x7 != our XCR0 value 0x1f; Replay will probably fail because glibc dynamic loader examines XCR0
...
Reading symbols from /tmp/tmp.EK4E99SCYP/julia-debug-8/mmap_pack_29_julia-debug...
Really redefine built-in command "restart"? (y or n) [answered Y; input not from terminal]
Remote debugging using 127.0.0.1:63926
Reading symbols from /lib64/ld-linux-x86-64.so.2...
(No debugging symbols found in /lib64/ld-linux-x86-64.so.2)
0x00007fb56c528090 in ?? () from /lib64/ld-linux-x86-64.so.2
(rr) c
Continuing.
[FATAL ../src/ReplaySession.cc:473:cont_syscall_boundary()]
(task 260535 (rec:10918) at time 448)
-> Assertion `false' failed to hold. Replay got unrecorded signal {signo:SIGSEGV,errno:SUCCESS,code:SEGV_MAPERR,addr:0}
...
=== End rr backtrace
Launch gdb with
gdb '-l' '10000' '-ex' 'set sysroot /' '-ex' 'target extended-remote 127.0.0.1:63927' /tmp/tmp.EK4E99SCYP/julia-debug-8/mmap_pack_29_julia-debug
In another terminal:
$ gdb '-l' '10000' '-ex' 'set sysroot /' '-ex' 'target extended-remote 127.0.0.1:63927' /tmp/tmp.EK4E99SCYP/julia-debug-8/mmap_pack_29_julia-debug
...
(gdb) run
Starting program: /tmp/tmp.EK4E99SCYP/julia-debug-8/mmap_pack_29_julia-debug
/tmp/tmp.EK4E99SCYP/julia-debug-8/mmap_pack_29_julia-debug: error while loading shared libraries: libjulia-debug.so.1: cannot open shared object file: No such file or directory
[Inferior 1 (process 260808) exited with code 0177]
Is this expected?
Yeah, replaying on another machine needs a kernel patch.
Is reproducing the crash still unreliable in your machine? With ASAN, _some_ type of crash (segfault or abort via ASAN) almost always occurs within ~20 iterations for me. It doesn't require setting JULIA_NUM_THREADS
.
I'm using:
docker run -it --rm --cap-add=SYS_PTRACE --security-opt seccomp=unconfined ubuntu:latest
] add LLVM_full_jll@10
rr
from https://github.com/mozilla/rr/commit/03cafff77d9786a1fc2064c1d55cbf77fe6fd015julia
dc0c5c539c267064ad45f11ec725d27d94ff4b89 with following patch:diff --git a/src/aotcompile.cpp b/src/aotcompile.cpp
index 15623efe5a..c33010ccb0 100644
--- a/src/aotcompile.cpp
+++ b/src/aotcompile.cpp
@@ -23,6 +23,7 @@
#include <llvm/Transforms/Vectorize.h>
#if defined(JL_ASAN_ENABLED)
#include <llvm/Transforms/Instrumentation.h>
+#include <llvm/Transforms/Instrumentation/AddressSanitizer.h>
#endif
#include <llvm/Transforms/Scalar/GVN.h>
#include <llvm/Transforms/IPO/AlwaysInliner.h>
with CXXFLAGS += -DNDEBUG
in Make.user
and make
invoked as https://github.com/JuliaLang/julia/issues/35341#issuecomment-609130650.
Now that we have an rr trace, debugging it will be quite easy. I'll take a look in the morning.
Awesome. Thanks in advance!
Ok, I've finally managed to get my setup back into a state where I can successfully replay this. However, the debug symbols seem to be missing from the trace, so it's a bit hard to debug (not impossible, but would be easier with symbols). Did you rr pack
the trace directory before uploading it? I was in particular looking at julia-debug-8
.
Also, could you show the output of rr sources
on one of the trace directories on the machine that recorded it?
Alright, got it working again: Here's what rr spits out. The object in question is a subarray: Here's it's life:
Thread 1 hit Hardware access (read/write) watchpoint 1: *0x61000bc09e78
Old value = -1094795586
New value = -286331154
0x00005ff8616b8f95 in ?? ()
#0 0x00005ff8616b8f95 in ?? ()
#1 0x00000000004a950c in __asan_memset (block=0x61000bc09e40, c=238, size=128) at /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:26
#2 0x00007fb56b9d43c4 in jl_gc_big_alloc (ptls=0x404919615520, sz=48) at gc.c:899
#3 0x00007fb56b9d541e in jl_gc_pool_alloc (ptls=0x404919615520, pool_offset=1448, osize=48) at gc.c:1140
#4 0x000044b53225ced3 in SubArray () at subarray.jl:22
#5 SubArray () at subarray.jl:38
#6 SubArray () at subarray.jl:28
#7 _maybe_reindex () at subarray.jl:189
#8 _maybe_reindex () at subarray.jl:185
#9 _maybe_reindex () at subarray.jl:179
#10 unsafe_view () at subarray.jl:178
#11 view () at subarray.jl:164
#12 julia_iterate_524 (itr=..., state=<optimized out>) at iterators.jl:1137
#13 0x000044b532265e34 in iterate () at iterators.jl:1135
#14 _zip_iterate_some () at iterators.jl:351
#15 _zip_iterate_all () at iterators.jl:343
#16 iterate () at iterators.jl:333
#17 macro expansion () at task.jl:334
#18 julia_maptasks_612 (f=<error reading variable: Cannot access memory at address 0x3fe851eb851eb852>, xs=...) at /root/.julia/packages/ThreadsX/OsJPr/src/utils.jl:49
#19 0x000044b53226a153 in julia__quicksort!_615 (ys=..., xs=..., alg=..., order=..., cs=..., ys_is_result=0 '\000', mutable_xs=1 '\001') at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:74
#20 0x000044b53226ceee in macro expansion () at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136
#21 japi1_#92_589 () at threadingconstructs.jl:126
#22 0x00007fb56b8afd33 in jl_fptr_args (f=0x61000bbf2580, args=0x617000cd0010, nargs=0, m=0x6110040cbd80) at gf.c:2009
#23 0x00007fb56b8c6296 in _jl_invoke (F=0x61000bbf2580, args=0x617000cd0010, nargs=0, mfunc=0x61001a14f380, world=27606) at gf.c:2230
#24 0x00007fb56b8c7eab in jl_apply_generic (F=0x61000bbf2580, args=0x617000cd0010, nargs=0) at gf.c:2414
#25 0x00007fb56b92a10e in jl_apply (args=0x617000cd0008, nargs=1) at ./julia.h:1687
#26 0x00007fb56b92f26e in start_task () at task.c:687
#27 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x6576803a7000
Thread 1 hit Hardware access (read/write) watchpoint 1: *0x61000bc09e78
Old value = -286331154
New value = 1552986768
0x000044b53225cee7 in SubArray () at subarray.jl:22
22 subarray.jl: No such file or directory.
#0 0x000044b53225cee7 in SubArray () at subarray.jl:22
#1 SubArray () at subarray.jl:38
#2 SubArray () at subarray.jl:28
#3 _maybe_reindex () at subarray.jl:189
#4 _maybe_reindex () at subarray.jl:185
#5 _maybe_reindex () at subarray.jl:179
#6 unsafe_view () at subarray.jl:178
#7 view () at subarray.jl:164
#8 julia_iterate_524 (itr=..., state=<optimized out>) at iterators.jl:1137
#9 0x000044b532265e34 in iterate () at iterators.jl:1135
#10 _zip_iterate_some () at iterators.jl:351
#11 _zip_iterate_all () at iterators.jl:343
#12 iterate () at iterators.jl:333
#13 macro expansion () at task.jl:334
#14 julia_maptasks_612 (f=<error reading variable: Cannot access memory at address 0x3fe851eb851eb852>, xs=...) at /root/.julia/packages/ThreadsX/OsJPr/src/utils.jl:49
#15 0x000044b53226a153 in julia__quicksort!_615 (ys=..., xs=..., alg=..., order=..., cs=..., ys_is_result=0 '\000', mutable_xs=1 '\001') at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:74
#16 0x000044b53226ceee in macro expansion () at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136
#17 japi1_#92_589 () at threadingconstructs.jl:126
#18 0x00007fb56b8afd33 in jl_fptr_args (f=0x61000bbf2580, args=0x617000cd0010, nargs=0, m=0x6110040cbd80) at gf.c:2009
#19 0x00007fb56b8c6296 in _jl_invoke (F=0x61000bbf2580, args=0x617000cd0010, nargs=0, mfunc=0x61001a14f380, world=27606) at gf.c:2230
#20 0x00007fb56b8c7eab in jl_apply_generic (F=0x61000bbf2580, args=0x617000cd0010, nargs=0) at gf.c:2414
#21 0x00007fb56b92a10e in jl_apply (args=0x617000cd0008, nargs=1) at ./julia.h:1687
#22 0x00007fb56b92f26e in start_task () at task.c:687
#23 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x6576803a7000
Thread 1 hit Hardware access (read/write) watchpoint 1: *0x61000bc09e78
Value = 1552986768
0x00007fb56b9ec609 in sweep_big_list (sweep_full=0, pv=0x404919615760) at gc.c:914
914 gc.c: No such file or directory.
#0 0x00007fb56b9ec609 in sweep_big_list (sweep_full=0, pv=0x404919615760) at gc.c:914
#1 0x00007fb56b9ebfca in sweep_big (ptls=0x404919615520, sweep_full=0) at gc.c:954
#2 0x00007fb56b9eadfd in gc_sweep_other (ptls=0x404919615520, sweep_full=0) at gc.c:1407
#3 0x00007fb56b9e37ed in _jl_gc_collect (ptls=0x404919615520, collection=JL_GC_AUTO) at gc.c:3013
#4 0x00007fb56b9e2546 in jl_gc_collect (collection=JL_GC_AUTO) at gc.c:3109
#5 0x00007fb56b9d4527 in maybe_collect (ptls=0x404919615520) at gc.c:827
#6 0x00007fb56b9d4179 in jl_gc_big_alloc (ptls=0x404919615520, sz=32) at gc.c:883
#7 0x00007fb56b9d541e in jl_gc_pool_alloc (ptls=0x404919615520, pool_offset=1424, osize=32) at gc.c:1140
#8 0x000044b532265ea2 in _zip_iterate_interleave () at iterators.jl:363
#9 _zip_iterate_all () at iterators.jl:347
#10 iterate () at iterators.jl:333
#11 macro expansion () at task.jl:334
#12 julia_maptasks_612 (f=<error reading variable: Cannot access memory at address 0x3fe851eb851eb852>, xs=...) at /root/.julia/packages/ThreadsX/OsJPr/src/utils.jl:49
#13 0x000044b53226a153 in julia__quicksort!_615 (ys=..., xs=..., alg=..., order=..., cs=..., ys_is_result=0 '\000', mutable_xs=1 '\001') at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:74
#14 0x000044b53226ceee in macro expansion () at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136
#15 japi1_#92_589 () at threadingconstructs.jl:126
#16 0x00007fb56b8afd33 in jl_fptr_args (f=0x61000bbf2580, args=0x617000cd0010, nargs=0, m=0x6110040cbd80) at gf.c:2009
#17 0x00007fb56b8c6296 in _jl_invoke (F=0x61000bbf2580, args=0x617000cd0010, nargs=0, mfunc=0x61001a14f380, world=27606) at gf.c:2230
#18 0x00007fb56b8c7eab in jl_apply_generic (F=0x61000bbf2580, args=0x617000cd0010, nargs=0) at gf.c:2414
#19 0x00007fb56b92a10e in jl_apply (args=0x617000cd0008, nargs=1) at ./julia.h:1687
#20 0x00007fb56b92f26e in start_task () at task.c:687
#21 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x6576803a7000
Thread 1 hit Hardware access (read/write) watchpoint 1: *0x61000bc09e78
Old value = 1552986768
New value = -1145324613
0x00005ff8616b8f95 in ?? ()
#0 0x00005ff8616b8f95 in ?? ()
#1 0x00000000004a950c in __asan_memset (block=0x61000bc09e40, c=187, size=128) at /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:26
#2 0x00007fb56b9ec920 in sweep_big_list (sweep_full=0, pv=0x404919615760) at gc.c:938
#3 0x00007fb56b9ebfca in sweep_big (ptls=0x404919615520, sweep_full=0) at gc.c:954
#4 0x00007fb56b9eadfd in gc_sweep_other (ptls=0x404919615520, sweep_full=0) at gc.c:1407
#5 0x00007fb56b9e37ed in _jl_gc_collect (ptls=0x404919615520, collection=JL_GC_AUTO) at gc.c:3013
#6 0x00007fb56b9e2546 in jl_gc_collect (collection=JL_GC_AUTO) at gc.c:3109
#7 0x00007fb56b9d4527 in maybe_collect (ptls=0x404919615520) at gc.c:827
#8 0x00007fb56b9d4179 in jl_gc_big_alloc (ptls=0x404919615520, sz=32) at gc.c:883
#9 0x00007fb56b9d541e in jl_gc_pool_alloc (ptls=0x404919615520, pool_offset=1424, osize=32) at gc.c:1140
#10 0x000044b532265ea2 in _zip_iterate_interleave () at iterators.jl:363
#11 _zip_iterate_all () at iterators.jl:347
#12 iterate () at iterators.jl:333
#13 macro expansion () at task.jl:334
#14 julia_maptasks_612 (f=<error reading variable: Cannot access memory at address 0x3fe851eb851eb852>, xs=...) at /root/.julia/packages/ThreadsX/OsJPr/src/utils.jl:49
#15 0x000044b53226a153 in julia__quicksort!_615 (ys=..., xs=..., alg=..., order=..., cs=..., ys_is_result=0 '\000', mutable_xs=1 '\001') at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:74
#16 0x000044b53226ceee in macro expansion () at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136
#17 japi1_#92_589 () at threadingconstructs.jl:126
#18 0x00007fb56b8afd33 in jl_fptr_args (f=0x61000bbf2580, args=0x617000cd0010, nargs=0, m=0x6110040cbd80) at gf.c:2009
#19 0x00007fb56b8c6296 in _jl_invoke (F=0x61000bbf2580, args=0x617000cd0010, nargs=0, mfunc=0x61001a14f380, world=27606) at gf.c:2230
#20 0x00007fb56b8c7eab in jl_apply_generic (F=0x61000bbf2580, args=0x617000cd0010, nargs=0) at gf.c:2414
#21 0x00007fb56b92a10e in jl_apply (args=0x617000cd0008, nargs=1) at ./julia.h:1687
#22 0x00007fb56b92f26e in start_task () at task.c:687
#23 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x6576803a7000
=================================================================
==10918==ERROR: AddressSanitizer: heap-use-after-free on address 0x61000bc09e78 at pc 0x7fb56b9e0312 bp 0x6576803a49e0 sp 0x6576803a49d8
READ of size 8 at 0x61000bc09e78 thread T0
#0 0x7fb56b9e0311 in gc_try_setmark /julia/src/gc.c:1642:24
#1 0x7fb56b9df5e8 in gc_mark_scan_obj8 /julia/src/gc.c:1836:14
#2 0x7fb56b9d75f4 in gc_mark_loop /julia/src/gc.c:2117:9
#3 0x7fb56b9e2fba in _jl_gc_collect /julia/src/gc.c:2903:5
#4 0x7fb56b9e2545 in jl_gc_collect /julia/src/gc.c:3109:13
#5 0x7fb56b9d4526 in maybe_collect /julia/src/gc.c:827:9
#6 0x7fb56b9d4178 in jl_gc_big_alloc /julia/src/gc.c:883:5
#7 0x7fb56b9d541d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
LLVMSymbolizer: error reading file: No such file or directory
#8 0x44b532265f37 (/memfd:julia-codegen (deleted)+0x31f37)
#9 0x44b53226a152 (/memfd:julia-codegen (deleted)+0x36152)
#10 0x44b53226ceed (/memfd:julia-codegen (deleted)+0x38eed)
#11 0x7fb56b8afd32 in jl_fptr_args /julia/src/gf.c:2009:12
#12 0x7fb56b8c6295 in _jl_invoke /julia/src/gf.c:2230:31
#13 0x7fb56b8c7eaa in jl_apply_generic /julia/src/gf.c:2414:12
#14 0x7fb56b92a10d in jl_apply /julia/src/./julia.h:1687:12
#15 0x7fb56b92f26d in start_task /julia/src/task.c:687:19
This definitely looks like the compiler is failing to insert a GC root. However, when I just ask it for the LLVM IR, that allocation where the root is deleted doesn't happen (it's optimized out), so there most be something else going on.
Yeah, this is a regression in LateLowerGCFrame, caused by https://github.com/JuliaLang/julia/commit/4776a8def48bb5e42b1fd2f4d12f92e774e37b0e#diff-1b7282e2ae2a9c71d694e66b58d43e07
Reproducer:
define i8 @lost_select_decayed(i1 %arg1) {
; CHECK-LABEL: @lost_select_decayed
; CHECK: %gcframe = alloca %jl_value_t addrspace(10)*, i32 3
; CHECK: [[GEP0:%.*]] = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 2
; CHECK: store %jl_value_t addrspace(10)* %obj1, %jl_value_t addrspace(10)** [[GEP0]]
top:
%ptls = call %jl_value_t*** @julia.ptls_states()
%obj1 = call %jl_value_t addrspace(10) *@alloc()
%decayed = addrspacecast %jl_value_t addrspace(10) *%obj1 to %jl_value_t addrspace(11)*
%selected = select i1 %arg1, %jl_value_t addrspace(11)* null, %jl_value_t addrspace(11)* %decayed
%casted = bitcast %jl_value_t addrspace(11)* %selected to i8 addrspace(11)*
call void @jl_safepoint()
%val = load i8, i8 addrspace(11)* %casted
ret i8 %val
}
It'll fail to protect the allocation across the safepoint. Thanks for getting me the rr trace @tkf - This would have been extremely difficult without, since it doesn't show up by just looking at code_llvm
for that function. The fix is fairly straightforward, but it's getting early here, so I'll fix it in the afternoon.
Thank you so, _so much_! I'm glad that rr
trace thing was useful. It's magic!
So the bug is now found and there is no need for re-uploading the trace? Though looking at the shell history it looks like I did rr pack
. I can try this again or create a new one.
Yes, you did do rr pack
, but asan was confusing gdb, so it looked like there were no symbols. I was able to work around that, get it working and find the bug.
Great!
Is the hang that requires SIGKILLing the tracee reproducible? I'm trying to reproduce, but failing. It'd be good to know what rr is up to there, so I can fix it.
That's julia-debug-8
and julia-debug-5
.
You can click "Original output" and see what happened around when I send the SIGKILL. The julia-debug
processes hang after an abort in julia-debug-8
(from ASAN) and a segfault in julia-debug-5
(not from ASAN). I guess you need to go pass those signals in gdb to debug that? (Is it possible?)
By the way, just in case if it helps to guess what happened, I also remember that rr
and julia-debug
did not show up in top
when it happened. So I was imagining it was some kind of deadlock during teardown in julia.
Yes, something weird happened in rr when julia died. I can't reproduce it locally. If you see it again, try to attach gdb to the rr process (by finding it in ps aux
- don't try the gdb line it prints, which tries to use rr to attach to julia). I'd like to know where it deadlocked.
I have one process like julia-debug-5 still hanging. Here is its backtrace:
(gdb) attach 10880
Attaching to process 10880
[New LWP 10881]
[New LWP 10882]
[New LWP 10883]
[New LWP 10884]
[New LWP 10885]
[New LWP 10886]
[New LWP 10887]
[New LWP 10888]
[New LWP 10889]
[New LWP 10890]
[New LWP 10891]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f215e68a7e8 in __waitid (idtype=P_PID, id=10893, infop=0x7ffcdd62b950, options=16777218)
at ../sysdeps/unix/sysv/linux/waitid.c:29
29 ../sysdeps/unix/sysv/linux/waitid.c: No such file or directory.
(gdb) bt
#0 0x00007f215e68a7e8 in __waitid (idtype=P_PID, id=10893, infop=0x7ffcdd62b950, options=16777218)
at ../sysdeps/unix/sysv/linux/waitid.c:29
#1 0x000055b25a41d86a in rr::Task::wait_exit (this=0x55b25bf2e5c0) at /rr/src/Task.cc:149
#2 0x000055b25a41da4f in rr::Task::proceed_to_exit (this=0x55b25bf2e5c0) at /rr/src/Task.cc:163
#3 0x000055b25a336ea1 in rr::handle_ptrace_exit_event (t=0x55b25bf2e5c0) at /rr/src/RecordSession.cc:198
#4 0x000055b25a3404e4 in rr::RecordSession::record_step (this=0x55b25bf132d0) at /rr/src/RecordSession.cc:2056
#5 0x000055b25a332ef0 in rr::record (args=std::vector of length 3, capacity 8 = {...}, flags=...)
at /rr/src/RecordCommand.cc:617
#6 0x000055b25a3339bf in rr::RecordCommand::run (this=0x55b25a8272b0 <rr::RecordCommand::singleton>,
args=std::vector of length 3, capacity 8 = {...}) at /rr/src/RecordCommand.cc:740
#7 0x000055b25a473dde in main (argc=6, argv=0x7ffcdd62c118) at /rr/src/main.cc:268
Can you show the output of /proc/10893/status?
Name: julia-debug
State: D (disk sleep)
Tgid: 10892
Ngid: 0
Pid: 10893
PPid: 10880
TracerPid: 10880
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 1024
Groups:
NStgid: 10892
NSpid: 10893
NSpgid: 10880
NSsid: 7094
VmPeak: 21476554264 kB
VmSize: 21476304084 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 1974604 kB
VmRSS: 1781988 kB
VmData: 21475976676 kB
VmStk: 0 kB
VmExe: 1168 kB
VmLib: 110412 kB
VmPTE: 108684 kB
VmPMD: 46816 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
Threads: 5
SigQ: 1/515037
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000004226
SigIgn: 0000000000001010
SigCgt: 00000001c0000aea
CapInh: 00000000a80c25fb
CapPrm: 00000000a80c25fb
CapEff: 00000000a80c25fb
CapBnd: 00000000a80c25fb
CapAmb: 0000000000000000
Seccomp: 2
Cpus_allowed: 0000,00000010
Cpus_allowed_list: 4
Mems_allowed: 00000000,00000003
Mems_allowed_list: 0-1
voluntary_ctxt_switches: 57
nonvoluntary_ctxt_switches: 0
Output of
lsof -p 10892
:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
julia-deb 10892 root cwd DIR 0,56 4096 215 /julia
julia-deb 10892 root rtd DIR 0,56 4096 2 /
julia-deb 10892 root txt REG 0,56 9595832 85057 /julia/usr/bin/julia-debug
julia-deb 10892 root mem REG 0,56 32 80648 /obj/share/rr/rr_page_64
julia-deb 10892 root DEL REG 0,56 85344 /tmp/rr-shared-preload_thread_locals-10892-0
julia-deb 10892 root DEL REG 0,5 233876681 /memfd:julia-codegen
julia-deb 10892 root mem REG 0,56 229957 17909 /julia/usr/lib/libopenlibm.so.3.0
julia-deb 10892 root mem REG 0,56 153989760 85059 /julia/usr/lib/julia/sys-debug.so
julia-deb 10892 root mem REG 0,56 17136 19069 /julia/usr/lib/libsuitesparse_wrapper.so
julia-deb 10892 root mem REG 0,56 655808 17922 /julia/usr/lib/libgmp.so.10.3.2
julia-deb 10892 root mem REG 0,56 2030544 33 /lib/x86_64-linux-gnu/libc-2.27.so
julia-deb 10892 root mem REG 0,56 1995888 22380 /julia/usr/lib/libgfortran.so.4
julia-deb 10892 root mem REG 0,56 47652 19050 /julia/usr/lib/libccolamd.so.2.9.6
julia-deb 10892 root mem REG 0,56 31250 19051 /julia/usr/lib/libcolamd.so.2.9.6
julia-deb 10892 root mem REG 0,56 31767024 85053 /julia/usr/lib/libjulia-debug.so.1.5
julia-deb 10892 root mem REG 0,56 39059 19040 /julia/usr/lib/libamd.so.2.4.6
julia-deb 10892 root mem REG 0,56 2406176 19346 /julia/usr/lib/libmpfr.so.6.0.2
julia-deb 10892 root mem REG 0,56 30301432 19502 /julia/usr/lib/libopenblas64_.0.3.7.so
julia-deb 10892 root mem REG 0,56 1005880 19054 /julia/usr/lib/libcholmod.so.3.0.13
julia-deb 10892 root mem REG 0,56 10896 19072 /julia/usr/lib/libsuitesparseconfig.so.5.4.0
julia-deb 10892 root mem REG 0,56 18080 17907 /julia/usr/lib/libdSFMT.so
julia-deb 10892 root mem REG 0,56 577384 19087 /julia/usr/lib/libpcre2-8.so.0.7.0
julia-deb 10892 root mem REG 0,56 43470 19046 /julia/usr/lib/libcamd.so.2.4.6
julia-deb 10892 root mem REG 0,56 388656 80784 /obj/lib/rr/librrpreload.so
julia-deb 10892 root mem REG 0,56 14560 31 /lib/x86_64-linux-gnu/libdl-2.27.so
julia-deb 10892 root mem REG 0,56 144976 77 /lib/x86_64-linux-gnu/libpthread-2.27.so
julia-deb 10892 root mem REG 0,56 101488 22379 /julia/usr/lib/libgcc_s.so.1
julia-deb 10892 root mem REG 0,56 61870856 21896 /julia/usr/lib/libLLVM-9jl.so
julia-deb 10892 root DEL REG 0,56 85355 /tmp/rr-shared-syscallbuf.10896-10892-5
julia-deb 10892 root DEL REG 0,56 85354 /tmp/rr-shared-syscallbuf.10895-10892-4
julia-deb 10892 root DEL REG 0,56 85353 /tmp/rr-shared-syscallbuf.10894-10892-3
julia-deb 10892 root DEL REG 0,56 85352 /tmp/rr-shared-syscallbuf.10893-10892-2
julia-deb 10892 root mem REG 0,56 170960 26 /lib/x86_64-linux-gnu/ld-2.27.so
julia-deb 10892 root DEL REG 0,56 85345 /tmp/rr-shared-syscallbuf.10892-10892-1
julia-deb 10892 root mem REG 0,56 31680 95 /lib/x86_64-linux-gnu/librt-2.27.so
julia-deb 10892 root mem REG 0,56 1594864 840 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25
julia-deb 10892 root mem REG 0,56 1700792 93 /lib/x86_64-linux-gnu/libm-2.27.so
julia-deb 10892 root mem REG 0,56 264208 22381 /julia/usr/lib/libquadmath.so.0
julia-deb 10892 root 0u CHR 136,5 0t0 8 /dev/pts/5
julia-deb 10892 root 1u CHR 136,5 0t0 8 /dev/pts/5
julia-deb 10892 root 2u CHR 136,5 0t0 8 /dev/pts/5
julia-deb 10892 root 3u a_inode 0,11 0 8416 [eventpoll]
julia-deb 10892 root 4r FIFO 0,10 0t0 233876667 pipe
julia-deb 10892 root 5w FIFO 0,10 0t0 233876667 pipe
julia-deb 10892 root 6r FIFO 0,10 0t0 233876668 pipe
julia-deb 10892 root 7w FIFO 0,10 0t0 233876668 pipe
julia-deb 10892 root 8u a_inode 0,11 0 8416 [eventfd]
julia-deb 10892 root 9u CHR 136,5 0t0 8 /dev/pts/5
julia-deb 10892 root 10u CHR 136,5 0t0 8 /dev/pts/5
julia-deb 10892 root 11r CHR 1,3 0t0 6 /dev/null
julia-deb 10892 root 12u CHR 136,5 0t0 8 /dev/pts/5
julia-deb 10892 root 13u CHR 136,5 0t0 8 /dev/pts/5
julia-deb 10892 root 14u CHR 136,5 0t0 8 /dev/pts/5
julia-deb 10892 root 15u CHR 136,5 0t0 8 /dev/pts/5
julia-deb 10892 root 16u REG 0,5 134217728 233876681 /memfd:julia-codegen (deleted)
julia-deb 10892 root 17r CHR 1,9 0t0 11 /dev/urandom
julia-deb 10892 root 18r FIFO 0,10 0t0 233876709 pipe
julia-deb 10892 root 19w FIFO 0,10 0t0 233876709 pipe
julia-deb 10892 root 100u a_inode 0,11 0 8416 [perf_event]
julia-deb 10892 root 101u a_inode 0,11 0 8416 [perf_event]
julia-deb 10892 root 102u a_inode 0,11 0 8416 [perf_event]
julia-deb 10892 root 103u a_inode 0,11 0 8416 [perf_event]
julia-deb 10892 root 104u a_inode 0,11 0 8416 [perf_event]
julia-deb 10892 root 999w CHR 1,3 0t0 6 /dev/null
julia-deb 10892 root 1001u unix 0x0000000000000000 0t0 233892917 type=STREAM
rr fix in https://github.com/mozilla/rr/pull/2493
Most helpful comment
Yeah, this is a regression in LateLowerGCFrame, caused by https://github.com/JuliaLang/julia/commit/4776a8def48bb5e42b1fd2f4d12f92e774e37b0e#diff-1b7282e2ae2a9c71d694e66b58d43e07
Reproducer:
It'll fail to protect the allocation across the safepoint. Thanks for getting me the rr trace @tkf - This would have been extremely difficult without, since it doesn't show up by just looking at
code_llvm
for that function. The fix is fairly straightforward, but it's getting early here, so I'll fix it in the afternoon.