Rust: "died due to signal 11" in collectionstests on arm-android

Created on 11 Nov 2018  路  29Comments  路  Source: rust-lang/rust

Symptom: The arm-android test failed with the following messages:

[01:44:38] died due to signal 11
[01:44:38] error: test failed, to rerun pass '--test collectionstests'
[01:44:38] 
[01:44:38] 
[01:44:38] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "test" "--target" "arm-linux-androideabi" "-j" "4" "--release" "--locked" "--color" "always" "--features" "panic-unwind backtrace" "--manifest-path" "/checkout/src/libstd/Cargo.toml" "-p" "alloc" "--"
[01:44:38] expected success, got: exit code: 3
[01:44:38] 
[01:44:38] 
[01:44:38] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test --target arm-linux-androideabi
[01:44:38] Build completed unsuccessfully in 1:32:50

Previous instances:

This might be caused by a mis-optimization like #49775.

A-github-actions A-spurious C-tracking-issue E-help-wanted O-android P-medium T-infra

Most helpful comment

Looking at the memory dumps from the tombstone, we can see that the bug is somewhere in realloc:

Abort message: '@@@ ABORTING: invalid address or address of corrupt block 0xb8caf510 passed to dlfree'

memory near r7:
    b8caf4f8 00000000 00000000 00000000 00000000  
    b8caf508 00000000 00000000 00000000 8000000b  
    b8caf518 33323130 37363534 b8ca3938 00000001  
    b8caf528 00000018 00000006 00000000 00000000  
    b8caf538 000069d7 00000001 00000000 00000000  
    b8caf548 00000000 00000000 00000000 b2cf0f00  
    b8caf558 00000000 00000000 00000000 00000000  
    b8caf568 00000000 00000000 00000000 00000000  
    b8caf578 00000000 00000000 00000000 00000000  
    b8caf588 00000000 00000000 00000000 00000000  
    b8caf598 00000000 00000000 00000000 00000000  
    b8caf5a8 00000000 00000000 00000000 00000000  
    b8caf5b8 00000000 00000000 00000000 00000000  
    b8caf5c8 00000000 00000000 00000000 00000000  
    b8caf5d8 00000000 00000000 00000000 00000000  
    b8caf5e8 00000000 00000000 00000000 00000000  

Note the chunk prefix of 8000000b which says "this chunk has a length of 2GB + 8". This is clearly impossible since the address of the block (0xb8caf518) is in the second half of the address space.

Note that this bug is not an issue for modern Android versions since those use jemalloc instead of dlmalloc for the default libc allocator. In light of this I think we should just disable the try_reserve tests on Android.

All 29 comments

I suspect https://github.com/rust-lang/rust/pull/56869#issuecomment-451351624 on i686-musl is the same as this issue. Perhaps we should run the collectiontest in miri.

Program terminated with signal SIGSEGV, Segmentation fault.

#0  0x08166877 in unbin (i=8, c=0xa048008) at src/malloc/malloc.c:195
#1  malloc (n=<optimized out>) at src/malloc/malloc.c:322
#2  0x0814ddd9 in alloc () at src/libstd/sys/unix/alloc.rs:11
#3  __rdl_alloc () at src/libstd/alloc.rs:233
#4  0x08127c2d in alloc () at /rustc/b6fdcffc3d9d0b28d3a1fc34c49221ff13617b43/src/liballoc/alloc.rs:72
#5  exchange_malloc () at /rustc/b6fdcffc3d9d0b28d3a1fc34c49221ff13617b43/src/liballoc/alloc.rs:182
#6  new<core::cell::UnsafeCell<core::option::Option<core::result::Result<(), alloc::boxed::Box<Any>>>>> () at /rustc/b6fdcffc3d9d0b28d3a1fc34c49221ff13617b43/src/liballoc/sync.rs:288
#7  spawn_unchecked<closure,()> () at /rustc/b6fdcffc3d9d0b28d3a1fc34c49221ff13617b43/src/libstd/thread/mod.rs:458
#8  spawn<closure,()> () at /rustc/b6fdcffc3d9d0b28d3a1fc34c49221ff13617b43/src/libstd/thread/mod.rs:382
#9  test::run_test::run_test_inner::hb0857806220adfa0 () at src/libtest/lib.rs:1450
#10 0x0812645e in test::run_test::h5d2da69c3af16b8e () at src/libtest/lib.rs:1471
#11 0x08120a84 in run_tests<closure> () at src/libtest/lib.rs:1161
#12 test::run_tests_console::hd61c1544d577a32b () at src/libtest/lib.rs:957
#13 0x08119522 in test::test_main::h2355f0b379f819ba () at src/libtest/lib.rs:290
#14 0x08119c07 in test::test_main_static::h90a75711843ac1f7 () at src/libtest/lib.rs:324
#15 0x08106785 in collectionstests::main::h89799a166ae8ba23 ()

The stacktrace doesn't seem meaningful.

I've managed to reproduce this locally, using the same Android emulator used by CI. I'm working on creating a self-contained script to make it easy to test this locally.

I believe I've obtained a backtrace from the emulator:

#0  0x0000662c in ?? ()
#1  0xb6d2c61c in __rust_maybe_catch_panic ()
   from /data/tmp/work/libstd-a3b039e1022c4e23.so
#2  0xb6dc76d0 in test::run_test::run_test_inner::h59a42c89d1b0ae7a ()
   from /data/tmp/work/libtest-0c5e281140db6463.so
#3  0xb6dc7368 in test::run_test::hd576cc177e253177 ()
   from /data/tmp/work/libtest-0c5e281140db6463.so
#4  0xb6dc49fc in test::run_tests_console::h8730cd4311bcbbcb ()
   from /data/tmp/work/libtest-0c5e281140db6463.so
#5  0xb6dc1284 in test::test_main::h343b975aae7c6b90 ()
   from /data/tmp/work/libtest-0c5e281140db6463.so
#6  0xb6dc148c in test::test_main_static::hfd242c8df31b861a ()
   from /data/tmp/work/libtest-0c5e281140db6463.so
#7  0xb6e6779c in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::hf8fbc2e6d4c74fb5 ()
#8  0xb6d22244 in std::panicking::try::do_call::h49216886d8653049 ()
   from /data/tmp/work/libstd-a3b039e1022c4e23.so
#9  0xb6d2c61c in __rust_maybe_catch_panic ()
   from /data/tmp/work/libstd-a3b039e1022c4e23.so
#10 0xb6d206d4 in std::panic::catch_unwind::h9c8f0a4ad65b2a9d ()
   from /data/tmp/work/libstd-a3b039e1022c4e23.so
#11 0xb6cf8a10 in std::rt::lang_start_internal::hc01076141fd50efb ()
   from /data/tmp/work/libstd-a3b039e1022c4e23.so
#12 0xb6e589e4 in main ()

Interestingly enough, it appears that the process actually hangs in the emulator, rather than dying. This allowed me to install gdb and connect the process after the test had failed.

I strongly suspect that this is related to the panic_safe test in src/liballoc/tests/slice.rs. However, I'm not certain, as I'm having difficulty narrowing things down to the offending test.

Some additional results:

I added the following script as a test (it contains some logic from panic_safe):

use std::cell::Cell;
use std::thread;
use std::panic;

thread_local!(static SILENCE_PANIC: Cell<bool> = Cell::new(false));

#[test]
fn test_panic_hook() {
    let prev = panic::take_hook();
    panic::set_hook(Box::new(move |info| {
        if !SILENCE_PANIC.with(|s| s.get()) {
            prev(info);
        }
    }));

    for i in 0..1000 {
        let _ = thread::spawn(move || {
            SILENCE_PANIC.with(|s| s.set(true));
            panic!("Panicked from thread: {}", i);
        }).join();
    }
    println!("All done!");
}

I then manually uploaded it to the emulator, and invoked it in a loop with the following Bash script (run from /data/tmp/work):

export LD_LIBRARY_PATH='.'
while ./collectionstests-b33d04f51899f1c2 catch_panic; do :; done

After running this for about 15-20 minutes, I stopped the loop. For reasons I don't yet understand, several of the spawned processes has segfaulted, wthout stopping the loop.

I then uploaded a static GDB binary from here to the emulator (this was the only way I could manage to get GDB to work).

I managed to obtain the following backtrace from my test program:

#0  0xb6c855a4 in __futex_syscall3 () from /system/lib/libc.so
#1  0xb6c7768c in __pthread_cond_timedwait_relative ()
   from /system/lib/libc.so
#2  0xb6c776ec in __pthread_cond_timedwait () from /system/lib/libc.so
#3  0xb6c7b726 in pthread_join () from /system/lib/libc.so
#4  0xb6d03810 in std::sys::unix::thread::Thread::join::h07b837055b9f571f ()
   from libstd-a3b039e1022c4e23.so
#5  0xb6e57ccc in _$LT$std..thread..JoinHandle$LT$T$GT$$GT$::join::hd81a7f2f5464521c ()
#6  0xb6e5e944 in collectionstests::catch_panic::test_panic_hook::h9f08377776546cb3 ()
#7  0xb6dc7ca8 in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::h3970079b2a2e8325 () from libtest-0c5e281140db6463.so
#8  0xb6d3361c in __rust_maybe_catch_panic () from libstd-a3b039e1022c4e23.so
#9  0xb6dce6d0 in test::run_test::run_test_inner::h59a42c89d1b0ae7a ()
   from libtest-0c5e281140db6463.so
#10 0xb6dce368 in test::run_test::hd576cc177e253177 ()
   from libtest-0c5e281140db6463.so
#11 0xb6dcb9fc in test::run_tests_console::h8730cd4311bcbbcb ()
   from libtest-0c5e281140db6463.so
#12 0xb6dc8284 in test::test_main::h343b975aae7c6b90 ()
   from libtest-0c5e281140db6463.so
#13 0xb6dc848c in test::test_main_static::hfd242c8df31b861a ()
   from libtest-0c5e281140db6463.so
#14 0xb6e57424 in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::hf8fbc2e6d4c74fb5 ()
#15 0xb6d29244 in std::panicking::try::do_call::h49216886d8653049 ()
   from libstd-a3b039e1022c4e23.so
#16 0xb6d3361c in __rust_maybe_catch_panic () from libstd-a3b039e1022c4e23.so
#17 0xb6d276d4 in std::panic::catch_unwind::h9c8f0a4ad65b2a9d ()
   from libstd-a3b039e1022c4e23.so
#18 0xb6cffa10 in std::rt::lang_start_internal::hc01076141fd50efb ()
   from libstd-a3b039e1022c4e23.so
#19 0xb6e75078 in main ()

From what I can see, the panic hook appears to be executing, and then somehow jumping back into the body of the loop. EDIT: I believe that this is actually showing the normal invocation of my test function.

Note that this was the only thread running. I'm not sure if this is related to the original issue, but it suggets that something weird is going on with the panicking threads.

My current hypothesis is that there's some sort of concurrency and/or codegen issue with the the std::panic::catch_unwind logic (maybe the try compiler intrinsic?).

Further evidence for panicking being involved:

Another run crashed with this log (I wasn't able to obtain a backtrace):

test vec::drain_filter_complex ... ok
test str::slice_index::simple_big ... ok
test str::test_unsafe_slice ... ok
test string::test_unsized_to_string ... ok
died due to signal 11

The test immediately following string::test_unsized_to_string is string::test_str_truncate_split_codepoint. This test is marked with #[should_panic], so it seems that simply pancking while a test is running may trigger the issue.

Thanks @Aaron1011!

We discussed this in the infra team meeting today and decided that we would disable the should_panic tests on Android. This'll basically be a #[cfg(not(target_os = "android"))] gate on each of those tests. Note that android is a tier-2 platform so we (the Rust team) ourselves don't plan to invest time in fixing it, but if someone wants to attempt to narrow this down further we'll gladly review patches.

I'm temporarily assigning this to myself -- I plan to add that cfg soon. Of course, if someone wants to take that on and save me some time, I'd be happy :)

I'm going to use the functionality added in https://github.com/rust-lang/rust/pull/58689 instead of a cfg-gate across all of std/test/alloc -- that seems quite a bit cleaner.

Posted #58900 and unassigning myself. That PR won't fix all of the problems here so it doesn't fix this bug (it doesn't properly inform rustdoc to skip should_panic tests) so it should not close this issue.

While this happened rarely in Azure Pipelines (at least in the recent months), I noticed it happens a lot on GitHub Actions. We should find a workaround before we start gating on GHA.

@Aaron1011 Do you still have instructions for reproducing this bug locally? I will look into this issue.

@Amanieu: Unfortunately, I don't think that I do (I really should have documented what I previously did...)

I'll see if I can reconstruct what I did, and post instructions.

I ran the liballoc tests in the arm-android container in a loop overnight, and couldn't reproduce the issue.

@pietroalbini: Could you link to some of the failed Github Actions jobs?

It turns out that I was accidentally running the liballoc test with my local config.toml, which had debug assertions enabled. With the config.toml that gets built by the CI scripts, I was able to reproduce the issue.

@Amanieu: Run crash.sh from this branch. This script runs the liballoc tests in a loop until a crash occurs.

I was finally able to obtain an android tombstone file: https://gist.github.com/Aaron1011/94e97e3b5a7ba62f024ce4fda8211bdc

This was quite difficult to obtain - the emulator appears to get force-killed when a test fails (the outer Docker contain shuts down with the emulator still running), which caused the tombstone written to /data/tombstones/ to get lost. I needed to modify the remote-test-server tool to not exit when a binary failed, which allowed me to attach to the running Docker container: https://github.com/Aaron1011/rust/commit/ae403a656dc4b20d1327318e3de520f96e1d1923

I came across the following code:
https://github.com/rust-lang/rust/blob/83f8c02eb92b0dfc46ae27cb71421d7bbcd7f30f/src/libstd/sys/unix/alloc.rs#L55-L76

I don't know if it's related, but it's something to keep in mind.

API level 9 means Android 2.3-2.3.3 released 9 years ago which seems overly conservative.
From what I could find posix_memalign was added by API level 16 which is Android 4.1.x released almost 8 years ago.
Bumping API level locally and enabling posix_memalign seems worth giving a try. Rust should probably consider defaulting to some newer API as well.

It might be worth switching it to plain malloc temporarily to see if this fixes the crash (even though it is technically wrong to ignore the alignment).

Unlike x86 on ARM alignment is very important. Normal load and store instructions can't access unaligned memory.

Unlike x86 on ARM alignment is very important. Normal load and store instructions can't access unaligned memory.

Sure but the default malloc alignment is enough for that.

I think it's going to be extremely difficult to debug this further in the android vm. I'm currently trying to reproduce this without android, using a combination of dlmalloc and collectionstest built for the armv7-unknown-linux-gnueabihf target. However, it's possible that there's something Android-specific going on here (e.g. the emulator, the QEMU arm backend, or some #[cfg(android)] code.

So, here's what I got so far. I've narrowed it down to this command line, which runs only 35 tests:

$ export RUST_TEST_THREADS=44
$ export RUSTC_BOOTSTRAP=1
$ test1/collectionstests-3293dcd521e26794 -Zunstable-options --exclude-should-panic string::test_
running 35 tests
test string::test_push ... ok
test string::test_push_str ... ok
test string::test_push_bytes ... ok
test string::test_replace_range ... ok
test string::test_pop ... ok
test string::test_into_boxed_str ... ok
test string::test_from_utf8_lossy ... ok
test string::test_from_utf8 ... ok
test string::test_from_utf16_lossy ... ok
test string::test_from_str ... ok
test string::test_from_iterator ... ok
test string::test_from_cow_str ... ok
test string::test_extend_ref ... ok
test string::test_drain ... ok
test string::test_add_assign ... ok
test string::test_replace_range_empty ... ok
test string::test_replace_range_inclusive_range ... ok
test string::test_replace_range_unbounded ... ok
test string::test_retain ... ok
test string::test_simple_types ... ok
test string::test_slicing ... ok
test string::test_split_off_ascii ... ok
test string::test_split_off_empty ... ok
test string::test_split_off_unicode ... ok
test string::test_str_add ... ok
test string::test_str_clear ... ok
test string::test_str_truncate ... ok
test string::test_str_truncate_invalid_len ... ok
test string::test_unsized_to_string ... ok
test string::test_utf16_invalid ... ok
test string::test_vectors ... ok
test string::test_from_utf16 ... ok
test string::test_reserve_exact ... ok
test string::test_try_reserve ... ok
test string::test_try_reserve_exact ... ok

test result: ok. 35 passed; 0 failed; 0 ignored; 0 measured; 616 filtered out

Crashes are rare, but when it does crash it seems to always be in test_try_reserve_exact (confirmed by disassembling the address in the tombstone):

running 35 tests
test string::test_add_assign ... ok
test string::test_drain ... ok
test string::test_extend_ref ... ok
test string::test_from_cow_str ... ok
test string::test_from_iterator ... ok
test string::test_from_str ... ok
test string::test_from_utf16 ... ok
test string::test_from_utf16_lossy ... ok
test string::test_from_utf8 ... ok
test string::test_from_utf8_lossy ... ok
test string::test_into_boxed_str ... ok
test string::test_pop ... ok
test string::test_push_bytes ... ok
test string::test_push_str ... ok
test string::test_replace_range ... ok
test string::test_replace_range_empty ... ok
test string::test_replace_range_inclusive_range ... ok
test string::test_replace_range_unbounded ... ok
test string::test_reserve_exact ... ok
test string::test_retain ... ok
test string::test_simple_types ... ok
test string::test_slicing ... ok
test string::test_split_off_ascii ... ok
test string::test_split_off_empty ... ok
test string::test_split_off_unicode ... ok
test string::test_str_add ... ok
test string::test_str_clear ... ok
test string::test_str_truncate ... ok
test string::test_str_truncate_invalid_len ... ok
test string::test_unsized_to_string ... ok
test string::test_utf16_invalid ... ok
test string::test_vectors ... ok
Segmentation fault 

The segmentation fault is misleading, that's just what Android's libc uses for abort() instead of SIGABRT. The abort message is:

@@@ ABORTING: invalid address or address of corrupt block 0xb7bda648 passed to dlfree

From the looks of it, there is no problem in our code: this is purely a bug in Android's dlmalloc. My guess is that there is some integer overflow issue in dlmalloc's internal size calculations, but dependent on some internal malloc state (hence the non-determinism).

Looking at the memory dumps from the tombstone, we can see that the bug is somewhere in realloc:

Abort message: '@@@ ABORTING: invalid address or address of corrupt block 0xb8caf510 passed to dlfree'

memory near r7:
    b8caf4f8 00000000 00000000 00000000 00000000  
    b8caf508 00000000 00000000 00000000 8000000b  
    b8caf518 33323130 37363534 b8ca3938 00000001  
    b8caf528 00000018 00000006 00000000 00000000  
    b8caf538 000069d7 00000001 00000000 00000000  
    b8caf548 00000000 00000000 00000000 b2cf0f00  
    b8caf558 00000000 00000000 00000000 00000000  
    b8caf568 00000000 00000000 00000000 00000000  
    b8caf578 00000000 00000000 00000000 00000000  
    b8caf588 00000000 00000000 00000000 00000000  
    b8caf598 00000000 00000000 00000000 00000000  
    b8caf5a8 00000000 00000000 00000000 00000000  
    b8caf5b8 00000000 00000000 00000000 00000000  
    b8caf5c8 00000000 00000000 00000000 00000000  
    b8caf5d8 00000000 00000000 00000000 00000000  
    b8caf5e8 00000000 00000000 00000000 00000000  

Note the chunk prefix of 8000000b which says "this chunk has a length of 2GB + 8". This is clearly impossible since the address of the block (0xb8caf518) is in the second half of the address space.

Note that this bug is not an issue for modern Android versions since those use jemalloc instead of dlmalloc for the default libc allocator. In light of this I think we should just disable the try_reserve tests on Android.

Thanks for investigating this :heart:

Looking at the memory dumps from the tombstone, we can see that the bug is somewhere in realloc:

EDIT: I misunderstood how realloc_fallback is being used. The realloc function is in fact called by libstd in certain cases, so @Amanieu's analysis seems correct.

@Amanieu: How do you know it's caused by realloc? The tombstone file I captured contained the following backtrace for one of the threads (not the crashing one):

    #00  pc 0001c5a4  /system/lib/libc.so (__futex_syscall3+8)
    #01  pc 0000de88  /system/lib/libc.so
    #02  pc 0000f56d  /system/lib/libc.so (dlmalloc+48)
    #03  pc 000108a7  /system/lib/libc.so
    #04  pc 0000d9fb  /system/lib/libc.so (memalign+10)
    #05  pc 0006dcb0  /data/tmp/work/libstd-7d8e6950da20b76f.so (std::sys_common::alloc::realloc_fallback::h05f7cc89deafdb57+36)
    #06  pc 00073a50  /data/tmp/work/libstd-7d8e6950da20b76f.so (__rdl_realloc+40)
    #07  pc 0012a390  /data/tmp/work/test1/collectionstests-8c9066a5b2aa9a17 (alloc::raw_vec::RawVec$LT$T$C$A$GT$::reserve_exact::h5da8ae41586c5215+104)
    #08  pc 0014c50c  /data/tmp/work/test1/collectionstests-8c9066a5b2aa9a17 (collectionstests::vec::overaligned_allocations::h298719e516d2db4a+112)

It looks like Rust isn't actually calling into realloc from dlmalloc - instead, it's falling back to freeing and re-allocating memory. If realloc is getting called, then it's either being done by the standard library making an explicit call to an extern C function (which I think never happens for realloc), or it's being done by bionic libc.

Was this page helpful?
0 / 5 - 0 ratings