Upstream issue: https://bugs.llvm.org/show_bug.cgi?id=36005
The version of libunwind used in the x86_64-unknown-linux-musl (and possibly the i686-...-musl one too?) standard library has a bug where it will sometimes walk off the end of the segment containing the .eh_frame section and segfault.
I've struggled to reproduce this except in some proprietary code, but the backtrace looks like:
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000fa733b in libunwind::LocalAddressSpace::get32(unsigned long) ()
(gdb) bt
#0 0x0000000000fa733b in libunwind::LocalAddressSpace::get32(unsigned long) ()
#1 0x0000000000faa0a2 in libunwind::CFI_Parser<libunwind::LocalAddressSpace>::findFDE(libunwind::LocalAddressSpace&, unsigned long, unsigned long, unsigned int, unsigned long, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info*, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info*) ()
#2 0x0000000000fa983d in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getInfoFromDwarfSection(unsigned long, libunwind::UnwindInfoSections const&, unsigned int) ()
#3 0x0000000000fa923d in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::setInfoBasedOnIPRegister(bool) ()
#4 0x0000000000fa8fff in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step() ()
#5 0x0000000000fa820e in unw_step ()
#6 0x0000000000fa6ac0 in _Unwind_Backtrace ()
There's a proposed fix in the linked LLVM issue, which can possibly be patched into libunwind before building the musl target if taking an updated libunwind isn't possible.
I ran into this as well.
There doesn't seem to be any movement on the LLVM issue since I raised it, not even an assignee. Do you know if this is normal behaviour for them? Or if I've mis-raised the issue?
Running into this as well at OneSignal
I can reproduce this on latest stable on an open source project:
docker run -it ubuntu:18.04 bash
apt update
apt install curl git gcc make musl-tools file
curl https://sh.rustup.rs -sSf | sh
source $HOME/.cargo/env
rustup target add x86_64-unknown-linux-musl
git clone https://github.com/mozilla/sccache.git
cd sccache
export TARGET=x86_64-unknown-linux-musl && export OPENSSL_DIR=/openssl-musl
./scripts/travis-musl-openssl.sh
cargo build --target x86_64-unknown-linux-musl
RUST_LOG=sccache=debug RUST_BACKTRACE=1 SCCACHE_NO_DAEMON=1 SCCACHE_START_SERVER=1 $(pwd)/target/x86_64-unknown-linux-musl/debug/sccache &
RUST_LOG=sccache=debug $(pwd)/target/x86_64-unknown-linux-musl/debug/sccache gcc -c src/test/test.c -o /tmp/test.o
RUST_LOG=sccache=debug $(pwd)/target/x86_64-unknown-linux-musl/debug/sccache gcc -c src/test/test.c -o /tmp/test.o
On the second run of the final command, the background process will segfault.
After much investigation, I think there's two bugs here, one that LLVM's unwinder will consider unreadable memory as part of the .eh_frame section, and one that the rust compiler creates invalid (unusual?) unwind information.
The crash is specifically happening when trying to find unwind information for the frame above main (not crate::main but the "c-runtime" main. Since the frames above the C entry point are provided by the runtime (musl in this case), they are contained in crti.o/crt1.o which, in rust's musl-targeting stdlib have been provided. The provided object files have no unwind information in them:
$ readelf -S ~/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-musl/lib/crti.o
There are 19 section headers, starting at offset 0x4a0:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .text PROGBITS 0000000000000000 00000040
0000000000000000 0000000000000000 AX 0 0 1
[ 2] .data PROGBITS 0000000000000000 00000040
0000000000000000 0000000000000000 WA 0 0 1
[ 3] .bss NOBITS 0000000000000000 00000040
0000000000000000 0000000000000000 WA 0 0 1
[ 4] .init PROGBITS 0000000000000000 00000040
0000000000000001 0000000000000000 AX 0 0 1
[ 5] .fini PROGBITS 0000000000000000 00000041
0000000000000001 0000000000000000 AX 0 0 1
[ 6] .note.GNU-stack PROGBITS 0000000000000000 00000042
0000000000000000 0000000000000000 0 0 1
[ 7] .debug_line PROGBITS 0000000000000000 00000042
0000000000000056 0000000000000000 0 0 1
[ 8] .rela.debug_line RELA 0000000000000000 000002e0
0000000000000030 0000000000000018 I 17 7 8
[ 9] .debug_info PROGBITS 0000000000000000 00000098
0000000000000049 0000000000000000 0 0 1
[10] .rela.debug_info RELA 0000000000000000 00000310
0000000000000048 0000000000000018 I 17 9 8
[11] .debug_abbrev PROGBITS 0000000000000000 000000e1
0000000000000012 0000000000000000 0 0 1
[12] .debug_aranges PROGBITS 0000000000000000 00000100
0000000000000040 0000000000000000 0 0 16
[13] .rela.debug_arang RELA 0000000000000000 00000358
0000000000000048 0000000000000018 I 17 12 8
[14] .debug_ranges PROGBITS 0000000000000000 00000140
0000000000000040 0000000000000000 0 0 16
[15] .rela.debug_range RELA 0000000000000000 000003a0
0000000000000060 0000000000000018 I 17 14 8
[16] .shstrtab STRTAB 0000000000000000 00000400
000000000000009f 0000000000000000 0 0 1
[17] .symtab SYMTAB 0000000000000000 00000180
0000000000000150 0000000000000018 18 12 8
[18] .strtab STRTAB 0000000000000000 000002d0
000000000000000d 0000000000000000 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), l (large)
I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
O (extra OS processing required) o (OS specific), p (processor specific)
This in and of itself isn't a huge problem, the libunwind code stops trying to unwind when it can't find unwind information for the next frame (https://github.com/llvm-mirror/libunwind/blob/release_39/src/UnwindCursor.hpp#L1323-L1324), so it should stop after main which would be fine. so long as we can safely fail to find the unwind information.
The issue is that libunwind's logic for searching for FDE (frame description entry) for the parent frame is pretty all-encompassing and runs in many phases, and one of these crashes (sometimes) if there searched for address is not present in the unwind information). At a high level the process looks like (https://github.com/llvm-mirror/libunwind/blob/release_39/src/UnwindCursor.hpp#L1196)
For Rust x86_64 musl binaries, the compiler has provided DWARF unwind information, so we fall into the second bullet (https://github.com/llvm-mirror/libunwind/blob/release_39/src/UnwindCursor.hpp#L866):
.eh_frame_hdr section for an index for the frame - not present.eh_frame section for the address of interest - CRASH!Digging further, we see that the logic for scanning the .eh_frame sector looks like (https://github.com/llvm-mirror/libunwind/blob/release_39/src/DwarfParser.hpp#L175):
In each loop we check that the newly found CIE entry is in a reasonable place (between ehSectionStart and ehSectionEnd). If it's not, bail out for safety and fail the lookup.
Unfortunately, as in the linked issue originally ehSectionEnd is incorrectly set too big (it's set to ehSectionStart + **segmentLength** which is almost always wrong) so the unwind code will step right off the end of the .eh_frame segment and into one of:
.gcc_except_table (LSDA) - This starts with a signature that is approximately 0x9c9bff, which is bigger than the error introduced by the ehSectionEnd miscalculation, so (after failing to parse the LSDA as a CFI entry) it jumps forward out of the range ehSectionStart..ehSectionEnd and bails out because the of the "reasonable location" check I mentioned above..gcc_except_table (if the .eh_frame did not have a length that's a multiple of 32 and thus there's padding between the sections). In this case when it tries to read the next entry one of three things happens:u32) and we're in a similar position to the case where there's no padding, fail to parse the CFI entry, try to jump to the next one, realise we're jumping out of the allowed range and bail out for safety.u32, in this case, after the failed parse, the unwinder jumps far enough to be off the end of the loaded segment, but not far enough to be past the ehSectionEnd so the sanity check passes and the unwinder attempts to read the next CFI entry, reads a 32-bit number from an invalid address and... :boom:u32, and the unwinder jumps into a random part of the LSDA and we're back in a situation that's functionally similar to when we're reading the padding bytes and the same three options are still applicable in the next lookup.So, what's the bug(s) here?
end-of-records marker at the end of .eh_frameEither of these would fix this issue, probably both should be done?
Ah, there's another possible route for the crash:
ehSectionEnd leading to an invalid read and a :boom:I've ran into this issue as well. What's the status of this?
I've been looking at a proper fix for this and neither of the two paths I proposed above are yeilding simple results:
.eh_frame nor .eh_frame_ptr section contains a length field so we're not simply told the length/proc/self/exe) and parse the section table? Might not be readable to the current user. Might not even exist (not sure if this is possible, can you create a linux executable from an in-memory image?).eh_frames sectionrustc allows you to bring your own linkerld and gold both seem inconsistent about whether they add terminator entries, sometimes they do, sometimes they don'tAnother option would be for LLVM to skip walking the .eh_frame section if the lookup in the .eh_frame_ptr fails. It's unlikely to ever reveal anything new, and it seems that this section is just un-walkable with any kind of reliability.
Here are two tested workarounds, neither very pretty, but both work:
rustflags = ["-C", "-Wl,--verbose"] to your Cargo configFind the lines:
.eh_frame : ONLY_IF_RO { KEEP (*(.eh_frame)) *(.eh_frame.*) }
...
.eh_frame : ONLY_IF_RW { KEEP (*(.eh_frame)) *(.eh_frame.*) }
Change them to:
.eh_frame : ONLY_IF_RO { KEEP (*(.eh_frame)) *(.eh_frame.*) LONG(0x0) }
...
.eh_frame : ONLY_IF_RW { KEEP (*(.eh_frame)) *(.eh_frame.*) LONG(0x0) }
(0x00000000 is the CIE terminator)
rustflags = ["-C", "-Wl,-T<script>"] in Cargo config (probably under the Musl target)Create an object containing just the .eh_frame terminator:
; Create a terminator entry for the `.eh_frame` section of rust binaries.
;
; See https://github.com/rust-lang/rust/issues/47551 for details
;
; You can build this with:
;
; nasm -f elf64 eh_frame_terminator.asm
;
section .eh_frame
; The terminator is a 0-length CIE, the first field of which is the length
; as a 32-bit number.
dd 0x00000000
2, Add rustflags = ["-C", "link-args=-Wl,<path/to/>eh_frame_terminator.o"] to your Cargo config (probably under the Musl target)
Hi! Is this still an issue? If so, is there perhaps a simpler way to reproduce this than the instructions linked above?
The setup that's needed is:
.text and .eh_frame).eh_frame section needs to be deep into the segmentu32 of the .gcc_except_table sectionSo something like the following should trigger the fault:
#[derive(Clone, Copy)]
struct Foo {
array: [u64; 10240],
}
impl Foo {
const fn new() -> Self {
Self {
array: [0x1122_3344_5566_7788; 10240]
}
}
}
static BAR: [Foo; 10240] = [Foo::new(); 10240];
fn main() {
let bt = backtrace::Backtrace::new();
println!("Hello, world! {:?}", bt);
println!("{:x}", BAR[0].array[0]);
}
This builds a huge .rodata section, which lives before the .eh_frame section so should lead to the crash.
Hedging words because I can't reproduce the issue on my development machine (I run out of RAM and get OOM-killered).
Hmm, that doesn't seem to be sufficient (tested on the current stable and nightly). I get this output when building and running this on Arch Linux:
Hello, world! stack backtrace:
0: unwind_repro::main
at src/main.rs:19
1: std::rt::lang_start::{{closure}}
at /rustc/5c5b8afd80e6fa1d24632153cb2257c686041d41/src/libstd/rt.rs:61
2: std::rt::lang_start_internal::{{closure}}
at src/libstd/rt.rs:48
std::panicking::try::do_call
at src/libstd/panicking.rs:287
3: __rust_maybe_catch_panic
at src/libpanic_unwind/lib.rs:86
4: std::panicking::try
at src/libstd/panicking.rs:265
std::panic::catch_unwind
at src/libstd/panic.rs:395
std::rt::lang_start_internal
at src/libstd/rt.rs:47
5: std::rt::lang_start
at /rustc/5c5b8afd80e6fa1d24632153cb2257c686041d41/src/libstd/rt.rs:61
6: main
1122334455667788
I also get the same output when using std::backtrace::Backtrace::force_capture() to capture the backtrace.
I noticed the wrong ehSectionEnd calculation independently and filed https://bugs.llvm.org/show_bug.cgi?id=46829.
It looks like the gcc driver on Alpine adds the /usr/lib/gcc/x86_64-alpine-linux-musl/9.3.0/crtendS.o file to the link, which includes a terminator in .eh_frame. It seems that the rustc driver doesn't add something like this? (LLVM has a compiler-rt/lib/crt/crtend.c that has an .eh_frame terminator, so maybe that's relevant to rustc.)
FWIW, it looks like LLVM's LLD linker automatically adds a terminator to the end of .eh_frame, even if one isn't present in the linker inputs, so that could be a workaround:
$ rustc hello.rs && objdump -Wf hello | grep ZERO
$ rustc hello.rs -C link-args=-fuse-ld=lld && objdump -Wf hello | grep ZERO
00004d50 ZERO terminator
I think libunwind should stop scanning .eh_frame when it finds unwind info from .eh_frame_hdr (i.e. the GNU_EH_FRAME segment).
Rustc's musl-targeting link includes:
crt1.o crti.o [rust/native objects...] crtn.o
I don't exactly know where the crtX.o files come from (they're shipped with the compiler toolchain, e.g. mine are in ~/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-musl/lib/self-contained/) but looking at the musl codebase, there's no terminator in crtn.o:
https://git.musl-libc.org/cgit/musl/tree/crt/x86_64/crtn.s
.section .init
pop %rax
ret
.section .fini
pop %rax
ret
_(Intriguingly there's also no crt1.s in the musl codebase, suggesting Rustc gets these objects from somewhere else?)_
I wouldn't expect to see a crtbeginS.o or crtend(S).o since the rustc musl target doesn't include a C++ runtime and musl libc doesn't define either of these objects for the C runtime either.
Intriguingly there's also no crt1.s in the musl codebase...
It looks like musl's crt1 is a C file, not assembly. crt/{S,r,}crt1.c
When crtbegin and crtend were added to LLVM (https://reviews.llvm.org/D28791), there was some discussion about which project is responsible for which aspects of the CRT begin/end files. Apparently the status quo on Linux is that:
That is the case on my gLinux (i.e. Debian) system:
The .eh_frame terminator is usually in crtend, which is part of libgcc (not glibc or musl). LLVM's compiler-rt also provides crtbegin/crtend files.
The gcc and clang drivers link a crtend object even for C programs. On Alpine, the gcc driver links a libgcc crtend object.
It looks like rustc typically invokes the cc driver, but when it's targeting musl, it enters a mode where it provides the start files explicitly, and in this mode, it doesn't include crtbegin/crtend. For Alpine, I see -nostartfiles, rcrt1.o, crti.o, and crtn.o. For rustc/glibc, I don't see any of those flags, so the cc driver uses the default CRT begin/end/i/n/1 files.
I see that rustc documents a lot of this at compiler/rustc_target/src/spec/crt_objects.rs
//! Unlike native toolchains, rustc only currently adds the libc's objects during linking, //! but not gcc's. As a result rustc cannot link with C++ static libraries (#36710) //! when linking in self-contained mode.
FWIW, it looks like the MinGW rustc target has rsbegin.rs and rsend.rs files. rsend.rs has the .eh_frame terminator.
The unwinder needs the .eh_frame terminator, because the .eh_frame_hdr search table is optional, and because .eh_frame_hdr has the start address of .eh_frame, but not its size. (https://reviews.llvm.org/D86256 / https://reviews.llvm.org/D87750). I don't think it's reasonable to open the ELF file indicated by dlpi_name -- in principle, that duplicates the work done by the loader, and I can think of many specific things that wouldn't work. (e.g. the vdso, the executable w/glibc, Android's android_dlopen_ext and ability to map DSOs from zip files, unlinked files, etc).
Using self-contained mode is not generally recommended for serious work.
If the compiler heuristics enable it by default, then it can be explicitly disabled with -C link-self-contained=no, then gcc will be used for linking without -nostartfiles and will find all the necessary CRT objects by itself.
Then it's important to make sure that gcc for the right target (musl in this case) is used by rustc.
If the compiler heuristics enable it by default
Here's a FIXME for improving the heuristic for musl targets:
https://github.com/rust-lang/rust/blob/6d3acf5129767db78a3d9d62e814ec86b8870d75/compiler/rustc_codegen_ssa/src/back/link.rs#L1240-L1243
@petrochenkov Thanks for that, very interesting. I think there are two separate use cases for musl-targeting builds:
self-contained and should let the toolchain sort it out for us though I can't see an easy way to tell if a real toolchain is availablemusl-gcc no longer works as the "linker" for musl targets, since it doesn't support -pie which is enabled in rustc as of the last release)I'm worrying about the latter case and I _think_ from above the fix for that is:
rsend.rs to add the terminator in linux-musl builds as well as windows-gnu onesrsend.o and include it in the x86_64-unknown-linux-musl target package alongside the crtX.o (so rustup etc fetch it)rsend.o before crtn.oSome side questions:
rsbegin.rs too? (I think not, but I'm no expert :shrug:)i686 musl?In the latter case it would make sense to me for rustc to invoke an actual linker (ld/lld/gold/etc) since at the moment it goes via a compiler but has to tell the compiler to not do anything and just to pass the link args to the linker anyway. This is off topic for this discussion though.
If we need to improve the self-contained mode specifically, then I think we can just ship the begin/end objects and link to them like gcc does obsoleting the comment cited in https://github.com/rust-lang/rust/issues/47551#issuecomment-697132667. That appears to be the simplest solution.
@bossmc
If that doesn't work out (e.g. due to licensing), then the necessary parts can be added to rsend.rs and linked as rsend.o, as you suggested above.
Which crtbegin/crtend objects were you thinking? gcc's? LLVM's? If you're right and it's the compiler that ships the begin/end then maybe it's right for rustc to have it's own (though given that the link might contain code compiled by random compilers (from build.rs scripts) as well as compiled by rustc, which is "the compiler" for the sake of this discussion)?
@bossmc
gcc's? LLVM's?
I don't know, but looks like they should be compatible.
At least on Ubuntu both gcc and clang link to gcc's objects.
The gcc ones can be just copied by rustbuild if the license allows, the LLVM ones can be used if we need to build them by ourselves (they are a part of compiler-rt).
Most helpful comment
After much investigation, I think there's two bugs here, one that LLVM's unwinder will consider unreadable memory as part of the
.eh_framesection, and one that the rust compiler creates invalid (unusual?) unwind information.The crash is specifically happening when trying to find unwind information for the frame above
main(notcrate::mainbut the "c-runtime" main. Since the frames above the C entry point are provided by the runtime (musl in this case), they are contained incrti.o/crt1.owhich, in rust's musl-targeting stdlib have been provided. The provided object files have no unwind information in them:This in and of itself isn't a huge problem, the
libunwindcode stops trying to unwind when it can't find unwind information for the next frame (https://github.com/llvm-mirror/libunwind/blob/release_39/src/UnwindCursor.hpp#L1323-L1324), so it should stop aftermainwhich would be fine. so long as we can safely fail to find the unwind information.The issue is that libunwind's logic for searching for FDE (frame description entry) for the parent frame is pretty all-encompassing and runs in many phases, and one of these crashes (sometimes) if there searched for address is not present in the unwind information). At a high level the process looks like (https://github.com/llvm-mirror/libunwind/blob/release_39/src/UnwindCursor.hpp#L1196)
For Rust x86_64 musl binaries, the compiler has provided DWARF unwind information, so we fall into the second bullet (https://github.com/llvm-mirror/libunwind/blob/release_39/src/UnwindCursor.hpp#L866):
.eh_frame_hdrsection for an index for the frame - not present.eh_framesection for the address of interest - CRASH!Digging further, we see that the logic for scanning the
.eh_framesector looks like (https://github.com/llvm-mirror/libunwind/blob/release_39/src/DwarfParser.hpp#L175):In each loop we check that the newly found CIE entry is in a reasonable place (between
ehSectionStartandehSectionEnd). If it's not, bail out for safety and fail the lookup.Unfortunately, as in the linked issue originally
ehSectionEndis incorrectly set too big (it's set toehSectionStart + **segmentLength**which is almost always wrong) so the unwind code will step right off the end of the.eh_framesegment and into one of:.gcc_except_table(LSDA) - This starts with a signature that is approximately0x9c9bff, which is bigger than the error introduced by theehSectionEndmiscalculation, so (after failing to parse the LSDA as a CFI entry) it jumps forward out of the rangeehSectionStart..ehSectionEndand bails out because the of the "reasonable location" check I mentioned above..gcc_except_table(if the.eh_framedid not have a length that's a multiple of 32 and thus there's padding between the sections). In this case when it tries to read the next entry one of three things happens:u32) and we're in a similar position to the case where there's no padding, fail to parse the CFI entry, try to jump to the next one, realise we're jumping out of the allowed range and bail out for safety.u32, in this case, after the failed parse, the unwinder jumps far enough to be off the end of the loaded segment, but not far enough to be past theehSectionEndso the sanity check passes and the unwinder attempts to read the next CFI entry, reads a 32-bit number from an invalid address and... :boom:u32, and the unwinder jumps into a random part of the LSDA and we're back in a situation that's functionally similar to when we're reading the padding bytes and the same three options are still applicable in the next lookup.So, what's the bug(s) here?
end-of-recordsmarker at the end of.eh_frameEither of these would fix this issue, probably both should be done?