I've been working on a project called Rootbox that runs Alpine Linux inside a chroot, and I thought a great, easy way to show it off would be to statically-compile a Crystal project inside of it.
Key word: thought
Of course, I quickly realized that libgc and libevent aren't available statically under Alpine Linux. I built them manually, running into a Boehm GC bug in the process. And then I made it work!! And then I realized that IT DOESN'T ACTUALLY WORK!!
Basically, here's how I built Boehm GC:
sudo apk add libatomic_ops-dev
git clone https://github.com/ivmai/bdwgc.git
cd bdwgc
./autogen.sh
export CFLAGS="-D_GNU_SOURCE -DNO_GETCONTEXT -DUSE_MMAP -DIGNORE_DYNAMIC_LOADING"
./configure --prefix=/usr --datadir=/usr/share/doc/gc --enable-cplusplus \
--disable-parallel-mark --disable-shared --enable-munmap
make
sudo cp .libs/libgc.a /usr/lib
and then libevent:
curl -Lfo libevent.tgz $LIBEVENT_URL
tar xvf libevent.tgz
cd $LIBEVENT_STEM
./configure --prefix=/usr --sysconfdir=/etc
make
sudo cp .libs/libevent.a /usr/lib
and now the broken program itself:
echo 'raise "hello!"' > x.cr
crystal build -o x x.cr --link-flags -static
Then, when I run it, I get this:
DevPC-LX:~/w$ ./x
Invalid memory access (signal 11) at address 0x0
Segmentation fault (core dumped)
DevPC-LX:~/w$
The unwind handler in libgcc_eh is crashing. Great. Then I tried using LLVM's libunwind instead. Results:
DevPC-LX:~/w$ crystal build -o x x.cr --link-flags "-static -L/home/user/w"
DevPC-LX:~/w$ ./x
Failed to raise an exception: END_OF_STACK
[0] ???
DevPC-LX:~/w$
If you want to reproduce in the Rootbox, here's how (I swear, this isn't just a plug ;):
Run:
rootbox init
sudo rootbox image.add 3.5
sudo rootbox box.new stack-bug -f url:https://hastebin.com/raw/exovisedun
This will compile...uhh...everything.
Now run:
sudo rootbox box.run stack-bug
This pops you inside a shell. The buggy binaries and source file are inside the bug
directory (fitting name, eh?). e.g.:
DevPC-LX:~$ cd bug
DevPC-LX:~/bug$ ls
libgcc_eh.a x.cr x_libgcc x_libunwind
DevPC-LX:~/bug$ ./x_libgcc
Invalid memory access (signal 11) at address 0x0
Segmentation fault (core dumped)
DevPC-LX:~/bug$ ./x_libunwind
Failed to raise an exception: END_OF_STACK
[0] ???
DevPC-LX:~/bug$
C++ uses the same unwinding libraries but works perfectly when statically built.
Nice! You even made a PR to ivmai/bdwgc! Until the PR you sent to bdwgc is merged, should the GC_URL in the https://hastebin.com/raw/exovisedun
script be your own fork/branch?
I had the failed to raise with END_OF_STACK
issue when working on the ARM port. The problem was that we were raising from a libxml2 callback, but the libxml2 library (from raspbian) didn't have unwind tables, so the unwinder failed to unwind the stack, thus aborting (or crashing).
@bcardiff A workaround is to define -DIGNORE_DYNAMIC_LOADING
when building. (FWIW I tested this with the patched bdwgc, too, and it still failed.)
@ysbaddaden Well, the buggy program is literally just raise "123"
, so I'm not sure what could really go wrong...
The weirdest part is that if I build a dynamic executable but tell the linker to use all the static libraries, it works. It's just -static
that fails...
I reproduced the issue easily. I can statically compile a fully working Crystal compiler (no GC issue). It can compile itself, its specs, shards, minitest... But it segfaults whenever an exception is raise, incapable to unwind the stack. Trying to return a backtrace with p caller
segfaults, too.
AFAIK it seems to happen here:
https://github.com/gcc-mirror/gcc/blob/gcc-6_3_0-release/libgcc/unwind-dw2.c#L1568
An more specifically to this line, a call to pthread_once
:
https://github.com/gcc-mirror/gcc/blob/gcc-6_3_0-release/libgcc/gthr-posix.h#L699
Hypthesis: a conflict between GC —Crystal uses the GC_pthread_* wrappers
— and musl-libc that only happens when linked statically? We'll need a null garbage collector to verify that.
Thought it may not be the pthread_once
callback that actually segfaults. It could be init_dwarf_reg_size_table
or dwarf_reg_size_table[0]
.
What I found is that the %rip
register used to have a value, but suddenly becomes 0x0
...
I could also try using the libunwind code from C++ when linked with the Boehm GC. (FWIW throwing C++ exceptions works perfectly, but I haven't tested C++ and the Boehm GC.)
FWIW: disabling boehm gc for a null gc fixes the segfault. We fail to link instruction pointers to symbols, but at least it won't crash anymore:
alpine ~ $ crystal bug.cr
["0x7f27edb580e6: __crystal_main at /home/julien/bug.cr 1:1",
"0x7f27edb67384: main at /home/julien/crystal-0.23.0/src/main.cr 12:15",
"0x7f27ed8ad950: __libc_start_main at ??"]
alpine ~ $ crystal bug.cr --link-flags -static
["0x7fe8955c59ab: ??? at ??",
"0x7fe8955c593a: ??? at ??",
"0x7fe8955c590a: ??? at ??",
"0x7fe8955c33e2: ??? at ??",
"0x7fe8955b4016: ??? at ??",
"0x7fe8955c32b4: ??? at ??",
"0x7fe89562714b: ??? at ??"]
md5-69db7655673867109e54ef1bb90bca55
alpine ~ $ crystal bug.cr --link-flags "-static -lunwind"
[]
This https://github.com/crystal-lang/crystal/issues/4719#issuecomment-324596938 picked my curiosity, so I tried again:
Trying to unwind the stack in a static binary on alpine 3.6 still segfaults:
$ echo 'p caller' > bug.cr
$ bin/crystal run --static bug.cr
Invalid memory access (signal 11) at address 0x0
But if I statically compile Crystal itself on the same alpine 3.6, then it can unwind the stack... no segfaults:
$ CRYSTAL_PATH=$PWD/src crystal build src/compiler/crystal.cr \
--static -o .build/crystal -Dwithout_openssl -Dwithout_zlib
$ echo 'invalid call' > bug.cr
$ bin/crystal run --static bug.cr
Error in bug.cr:1: undefined local variable or method 'call'
invalid call
^~~~
I verified in a gdb session, and the compiler do raises an exception and do unwinds the stack, so it should segfault, but it doesn't.
I tried to run the minitest.cr test suite, statically linked, but it segfaults. Only the compiler seems safe. Maybe it has something to do with linking libstdc++?
There is something fishy happening here.
I really wish I could statically compile in alpine and port the binary from the build container to a redistributable scratch-based container. I've noticed stack traces do not work and using the crystal alpine package to build programs using -static
requires more hand-holding. I get a stack trace with every http request coming in to my binary which is static and copied from another machine but all debug symbols are missing so I cannot really debug easily why things are failing.
Perhaps these things are all related? Static compilation in alpine fails unless using release flag, unwind is not working... etc
SO I think this really isn't just Crystal: https://github.com/ldc-developers/ldc/issues/2341
I'm guessing musl does something fundamentally different regarding how the stack is laid out.
So I found out that simply require "llvm"
solves this. It's ugly, but it works.
specifically, require "llvm/lib_llvm"
(and require "llvm/enums"
to make the former compile) works.
I confirm that this issue is still there. END_OF_STACK
when using -lunwind
and a segmentation fault without, when an exception is raised. After little investigations, the problem appears to occur when performing CallStack.new
. Despite this, I don't see problems when not using static linking.
I believe the LLVM hack exposed above fixes the issue because we end up linking libstdc++
which implements . Statically linked C++ programs probably don't fail at raising an exception for this very reason (they link libstdc++).
Trying to unwind the stack in a C program will probably reproduce the bug.
Sadly we can't just -lstdc++
, but maybe we can use a symbol from libstdc++
to force linking against it, and see whether the issue is still present or not.
require "llvm"
or anything involving static linking and LLVM ends up by a:
/usr/lib/gcc/x86_64-alpine-linux-musl/6.4.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lLLVM-5.0
Compiling an LLVM again to might have a libLLVM-5.0.a
isn't really a solution. Here are some files that are used for static compilation, but no libLLVM-5.0.a
nor libLLVM.a
. Maybe this is just an incorrect name and a symlink will do the trick, I don't know.
We should find an other way. I've also tried with libc++
- works as libstdc++
but doesn't solve the issue.
@j8r you need to install llvm-dev to get llvm-config which you need to use to get the right link flags to statically link llvm. Crystal asks llvm-config for static flags when you pass --static to the crystal compiler.
@RX14 Yes that's what I've already done, installing llvm llvm-static llvm-dev
:
llvm-config --libs
-lLLVM-5.0
@j8r are you passing --static
to crystal build? If not, it won't work because llvm-config
needs the --link-static
flag to return the correct link flags for linking statically.
@RX14 It works, thanks! I was still trying with flags like -static
, -lunwind
, -lstdc++
. Note that changing -lstdc++
with -lc++
(libc++) when performing the cc
linking doesn't work for now. Now let's add this do the docs :)
@ysbaddaden Did you manage to -lstdc++
on alpine? it seems it can't find the lib
This issue has led to some issues while trying to mitigate it (see #7480). I think we should make fixing this a priority because it affects all programs statically linked with musl-libc.
This isn't a priority: it only affects programs statically linked against with musl-libc. Maybe choosing to distribute Crystal/Shards statically against musl-libs despite this issue wasn't a good idea in the first place.
The issue may be:
Trying to reproduce the issue in C would be interesting, looking at what LDC (LLVM based D compiler/runtime) does and how libgcc, libsdc++ and libc++ implement the feature would also be helpful. Reading the libgcc source code helped me fix unwinding the stack on ARM EHABI for example.
Trying the https://musl.cc images could be interesting, too.
it only affects programs statically linked against with musl-libc.
Yeah, but that's the only way to link a static libc and the main reason for linking against musl-libc anyway.
I think it's more about supporting musl-libc based Linux distributions, such as Alpine; allowing portable statically linked binaries comes second.
Using libexecinfo may help for generating backtraces.
Does this only fail on Alpine, I presume so?
@rdp Yes and no, because static linking is only supported with musl, and Alpine is the most popular distribution based on it.
I'm pretty sure it's not specific to Alpine, but fails on any dist when linking statically against musl.
So random fun, stack traces are still non-existent for statically linked binaries:
Unhandled exception: Errno: Error connecting to 'localhost:6379': Connection refused (Redis::CannotConnectError)
from ???
from ???
from ???
from ???
from ???
from ???
from ???
from ???
from ???
from ???
from ???
from ???
from ???
from ???
from ???
But if I run it in gdb it segfaults in the garbage collector:
$ gdb ./mubot
GNU gdb (GDB) Fedora 8.3.50.20190824-30.fc31
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./mubot...
(gdb) r
Starting program: /var/home/ryan/code/useless-bot/mubot
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7d0f270 in GC_find_limit_with_bound ()
(gdb) bt
#0 0x00007ffff7d0f270 in GC_find_limit_with_bound ()
#1 0x00007ffff7d0f329 in GC_init_linux_data_start ()
#2 0x00007ffff7d0e1d5 in GC_init ()
#3 0x00007ffff75859e3 in init () at /usr/share/crystal/src/gc/boehm.cr:127
#4 0x00007ffff79b6c34 in main (
argc=<error reading variable: Cannot access memory at address 0x7fff00000001>,
argv=0x7fffffffdb42) at /usr/share/crystal/src/crystal/main.cr:35
#5 0x00007ffff743612b in main (
argc=<error reading variable: Cannot access memory at address 0xf7d3178b00000001>,
argv=0x7fffffffdb42) at /usr/share/crystal/src/crystal/main.cr:115
(gdb)
which brings me full circle back to https://github.com/crystal-lang/crystal/issues/6934. However, if I link with libunwind (via -Wl,--allow-multiple-definition -lunwind
), I still get an error here, except now it's SIGBUS:
gdb (GDB) Fedora 8.3.50.20190824-30.fc31
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
--Type <RET> for more, q to quit, c to continue without paging--
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./mubot...
(gdb) r
Starting program: /var/home/ryan/code/useless-bot/mubot
Program received signal SIGBUS, Bus error.
0x00007ffff7d53270 in GC_find_limit_with_bound ()
(gdb) bt
#0 0x00007ffff7d53270 in GC_find_limit_with_bound ()
#1 0x00007ffff7d53329 in GC_init_linux_data_start ()
#2 0x00007ffff7d521d5 in GC_init ()
#3 0x00007ffff77c0418 in init () at /usr/share/crystal/src/gc/boehm.cr:127
#4 main (argc=<error reading variable: Cannot access memory at address 0x1>,
argv=0x7fffffffdb42) at /usr/share/crystal/src/crystal/main.cr:35
#5 main (argc=<error reading variable: Cannot access memory at address 0x1>,
argv=0x7fffffffdb42) at /usr/share/crystal/src/crystal/main.cr:115
(gdb)
EDIT: Wait no I'm just an idiot:
If the fault occurred in GC_find_limit, or with incremental collection enabled, this is probably normal. The collector installs handlers to take care of these. You will not see these unless you are using a debugger. Your debugger should allow you to continue. It's often preferable to tell the debugger to ignore SIGBUS and SIGSEGV ("handle SIGSEGV SIGBUS nostop noprint" in gdb, "ignore SIGSEGV SIGBUS" in most versions of dbx) and set a breakpoint in abort.
So with that in mind, if I break on __crystal_raise
in LLDB and GDB it does show debugging symbols (this is after linking with LLVM's libunwind):
$ gdb ./mubot
GNU gdb (GDB) Fedora 8.3.50.20190824-30.fc31
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./mubot...
(gdb) handle SIGSEGV SIGBUS nostop noprint
Signal Stop Print Pass to program Description
SIGBUS No No Yes Bus error
SIGSEGV No No Yes Segmentation fault
(gdb) b __crystal_raise
Breakpoint 1 at 0x683f0: file /usr/share/crystal/src/raise.cr, line 191.
(gdb) r
Starting program: /var/home/ryan/code/useless-bot/mubot
[New LWP 28169]
[New LWP 28170]
[New LWP 28171]
[New LWP 28172]
[New LWP 28173]
[New LWP 28174]
[New LWP 28175]
Thread 1 "mubot" hit Breakpoint 1, __crystal_raise (unwind_ex=0x0) at /usr/share/crystal/src/raise.cr:191
191 {% unless flag?(:win32) %}
(gdb) bt
#0 __crystal_raise (unwind_ex=0x0) at /usr/share/crystal/src/raise.cr:191
#1 0x00007ffff77bdce5 in raise (exception=0xc6) at /usr/share/crystal/src/raise.cr:191
#2 0x00007ffff79e7e47 in initialize (self=0x125, host=<optimized out>, port=<optimized out>) at /usr/share/crystal/src/socket/tcp_socket.cr:73
#3 new:dns_timeout:connect_timeout (__arg0=0x7ffff7717b20, __arg1=6379) at /usr/share/crystal/src/socket/tcp_socket.cr:27
#4 0x00007ffff79e5053 in initialize (self=0x293) at /var/home/ryan/code/useless-bot/lib/redis/src/redis/connection.cr:30
#5 new (host=<optimized out>, port=<optimized out>) at /var/home/ryan/code/useless-bot/lib/redis/src/redis/connection.cr:8
#6 connect (self=0x7ffff64fff00) at /var/home/ryan/code/useless-bot/lib/redis/src/redis.cr:157
#7 0x00007ffff79619cf in initialize:url (self=0x292, url=<optimized out>) at /var/home/ryan/code/useless-bot/lib/redis/src/redis.cr:117
#8 new:url (__temp_4845=<optimized out>) at /var/home/ryan/code/useless-bot/lib/redis/src/redis.cr:95
#9 initialize (self=0x2b6, config=<optimized out>) at /var/home/ryan/code/useless-bot/src/services/cache.cr:8
#10 new:config (__temp_4844=<optimized out>) at /var/home/ryan/code/useless-bot/src/services/cache.cr:7
#11 resolve! () at /var/home/ryan/code/useless-bot/src/services.cr:13
#12 resolve! () at /var/home/ryan/code/useless-bot/src/modules.cr:18
#13 main () at /var/home/ryan/code/useless-bot/src/mubot.cr:38
#14 0x00007ffff77c042e in main_user_code (argc=0, argv=<error reading variable: Cannot access memory at address 0x1>) at /usr/share/crystal/src/crystal/main.cr:106
#15 main (argc=0, argv=<error reading variable: Cannot access memory at address 0x1>) at /usr/share/crystal/src/crystal/main.cr:92
#16 main (argc=0, argv=<error reading variable: Cannot access memory at address 0x1>) at /usr/share/crystal/src/crystal/main.cr:115
(gdb)
and LLDB:
$ lldb ./mubot
(lldb) target create "./mubot"
Current executable set to './mubot' (x86_64).
(lldb) b main
Breakpoint 1: where = mubot`main + 25, address = 0x000000000006b409
(lldb) b __crystal_raise
Breakpoint 2: where = mubot`__crystal_raise + 1, address = 0x00000000000683f1
(lldb) r
Process 28267 launched: '/home/ryan/code/useless-bot/mubot' (x86_64)
Process 28267 stopped
* thread #1, name = 'mubot', stop reason = breakpoint 1.1
frame #0: 0x00007ffff77c0409 mubot`main at boehm.cr:124:5
121 end
122
123 def self.init
-> 124 {% unless flag?(:win32) %}
125 LibGC.set_handle_fork(1)
126 {% end %}
127 LibGC.init
(lldb) pr h -n false -s false -p true SIGSEGV SIGBUS
NAME PASS STOP NOTIFY
=========== ===== ===== ======
SIGSEGV true false false
SIGBUS true false false
(lldb) c
Process 28267 resuming
Process 28267 stopped
* thread #1, name = 'mubot', stop reason = breakpoint 2.1
frame #0: 0x00007ffff77bd3f1 mubot`__crystal_raise at raise.cr:191:1
188 end
189 {% end %}
190
-> 191 {% unless flag?(:win32) %}
192 # :nodoc:
193 @[Raises]
194 fun __crystal_raise(unwind_ex : LibUnwind::Exception*) : NoReturn
(lldb) bt
* thread #1, name = 'mubot', stop reason = breakpoint 2.1
* frame #0: 0x00007ffff77bd3f1 mubot`__crystal_raise at raise.cr:191:1
frame #1: 0x00007ffff77bdce5 mubot`*raise<Errno>:NoReturn at raise.cr:191:1
frame #2: 0x00007ffff79e7e47 mubot`*TCPSocket::new:dns_timeout:connect_timeout<String, Int32, (Time::Span | Nil), (Time::Span | Nil)>:TCPSocket at tcp_socket.cr:73:15
frame #3: 0x00007ffff79e5053 mubot`*Redis#connect:(Array(Redis::RedisValue) | Int64 | Redis::Future | String | Nil) at connection.cr:30:21
frame #4: 0x00007ffff79619cf mubot`*main:Nil at redis.cr:117:5
frame #5: 0x00007ffff77c042e mubot`main at main.cr:106:5
frame #6: 0x00007ffff7d727a9 mubot`libc_start_main_stage2(main=(mubot`main), argc=1, argv=0x00007fffffffd7b8) at __libc_start_main.c:94:2
frame #7: 0x00007ffff77b769b mubot`_start + 22
(lldb)
This works with both LLVM's libunwind and GCC default unwind system.
@refi64 The segfaults in GC_init
are expected, this is how bdwgc determines the start address of the current stack on musl-libc: it sets a segfault handler, reads from the stack until the segfault —on glibc it relies on some specific symbols instead. Just continue the program execution or compile with -Dgc_none
.
@ysbaddaden Indeed, hence the edit at the end of the original comment 😅 The second one got the backtraces properly, you can actually see in the output where I ignored those signals
Most helpful comment
FWIW: disabling boehm gc for a null gc fixes the segfault. We fail to link instruction pointers to symbols, but at least it won't crash anymore: