Describe the bug
Following default build and test instructions doesn't seem to work on a 64-bit arm Ubuntu...
To Reproduce
root@nanopc-t4-1:~/fluent-bit-broke/build# bin/fluent-bit -i cpu -o stdout
Fluent Bit v1.1.2
Copyright (C) Treasure Data
[2019/06/05 13:25:55] [ info] [storage] initializing...
[2019/06/05 13:25:55] [ info] [storage] in-memory
[2019/06/05 13:25:55] [ info] [storage] normal synchronization mode, checksum disabled
[2019/06/05 13:25:55] [ info] [engine] started (pid=9119)
[2019/06/05 13:25:55] [ info] [sp] stream processor started
[engine] caught signal (SIGSEGV)
Aborted
Your Environment
Linux nanopc-t4-1 4.4.174-rk3399 #31 SMP Sun Feb 10 00:37:23 CET 2019 aarch64 aarch64 aarch64 GNU/Linux
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic
Additional context
verified same code runs okay with 32-bit arm user space and 64-bit kernel, so its something to do with 64-bit user space.
Often there are issues specific to arm64 and LuaJIT, specifically when it comes to the use of storing pointers in lightuserdata.
You may find this issue of interest: https://github.com/LuaJIT/LuaJIT/issues/49 as this has been a common and enduring issue, with some workarounds.
I have not dug into the code yet.
I managed to build it on arm64 using the following Dockerfile: https://github.com/danacr/fluent-bit-docker-image/blob/1.1/Dockerfile
Looks like the problem is the compiler, If I compile with clang-3.9 there are no issues:
$ export CC=clang-3.9
$ export CXX=clang++-3.9
@fujimotos have troubleshot and fixed the root cause of the issue, more details here:
https://github.com/edsiper/flb_libco/pull/4
those changes in flb_libco will be merged shortly in Fluent Bit
Thanks @edsiper - if you find that this is a novel bug for GCC, I'm happy to help relay it upstream; if it's one that's known already it would be good to chase down the original report.
Changes already merged into GIT master.
@vielmetti here is the report on GCC Bugzilla:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90907
per their latest comments not sure it is worth or not invest more time on it. At least we got our fix :)
I see those comments @edsiper - it looks like we're in undefined territory here, worth keeping this in mind to test on future compiler releases that might or might not behave the same way.
yep, I will add compiler info to the built binary for debugging purposes
@edsiper @vielmetti
it looks like we're in undefined territory here, worth keeping this in mind to test on future compiler releases that might or might not behave the same way
After thinking about gcc's response last night, I posted a more robust
fix for the issue to https://github.com/edsiper/flb_libco/pull/6.
In essence, this avoids the undefined behaviour by declaring the section
to store the assembly code explicitly (and then popping the section back
after that).
In this way, we can be certain that co_switch_aarch64() gets stored in
the right section in the binary, so it should always work reliably.
Most helpful comment
@fujimotos have troubleshot and fixed the root cause of the issue, more details here:
https://github.com/edsiper/flb_libco/pull/4
those changes in flb_libco will be merged shortly in Fluent Bit