Fluent-bit: SIGSEGV on arm64 (following build instructions on 1.1)

Created on 5 Jun 2019  路  11Comments  路  Source: fluent/fluent-bit

Bug Report

Describe the bug
Following default build and test instructions doesn't seem to work on a 64-bit arm Ubuntu...

To Reproduce

root@nanopc-t4-1:~/fluent-bit-broke/build# bin/fluent-bit -i cpu -o stdout
Fluent Bit v1.1.2
Copyright (C) Treasure Data

[2019/06/05 13:25:55] [ info] [storage] initializing...
[2019/06/05 13:25:55] [ info] [storage] in-memory
[2019/06/05 13:25:55] [ info] [storage] normal synchronization mode, checksum disabled
[2019/06/05 13:25:55] [ info] [engine] started (pid=9119)
[2019/06/05 13:25:55] [ info] [sp] stream processor started
[engine] caught signal (SIGSEGV)
Aborted

Your Environment

Linux nanopc-t4-1 4.4.174-rk3399 #31 SMP Sun Feb 10 00:37:23 CET 2019 aarch64 aarch64 aarch64 GNU/Linux

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic
Additional context

bug fixed

Most helpful comment

@fujimotos have troubleshot and fixed the root cause of the issue, more details here:

https://github.com/edsiper/flb_libco/pull/4

those changes in flb_libco will be merged shortly in Fluent Bit

All 11 comments

verified same code runs okay with 32-bit arm user space and 64-bit kernel, so its something to do with 64-bit user space.

Often there are issues specific to arm64 and LuaJIT, specifically when it comes to the use of storing pointers in lightuserdata.

You may find this issue of interest: https://github.com/LuaJIT/LuaJIT/issues/49 as this has been a common and enduring issue, with some workarounds.

I have not dug into the code yet.

I managed to build it on arm64 using the following Dockerfile: https://github.com/danacr/fluent-bit-docker-image/blob/1.1/Dockerfile

Looks like the problem is the compiler, If I compile with clang-3.9 there are no issues:

$ export CC=clang-3.9
$ export CXX=clang++-3.9

@fujimotos have troubleshot and fixed the root cause of the issue, more details here:

https://github.com/edsiper/flb_libco/pull/4

those changes in flb_libco will be merged shortly in Fluent Bit

Thanks @edsiper - if you find that this is a novel bug for GCC, I'm happy to help relay it upstream; if it's one that's known already it would be good to chase down the original report.

Changes already merged into GIT master.

@vielmetti here is the report on GCC Bugzilla:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90907

per their latest comments not sure it is worth or not invest more time on it. At least we got our fix :)

I see those comments @edsiper - it looks like we're in undefined territory here, worth keeping this in mind to test on future compiler releases that might or might not behave the same way.

yep, I will add compiler info to the built binary for debugging purposes

@edsiper @vielmetti

it looks like we're in undefined territory here, worth keeping this in mind to test on future compiler releases that might or might not behave the same way

After thinking about gcc's response last night, I posted a more robust
fix for the issue to https://github.com/edsiper/flb_libco/pull/6.

In essence, this avoids the undefined behaviour by declaring the section
to store the assembly code explicitly (and then popping the section back
after that).

In this way, we can be certain that co_switch_aarch64() gets stored in
the right section in the binary, so it should always work reliably.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jcdauchy-moodys picture jcdauchy-moodys  路  3Comments

c0ze picture c0ze  路  3Comments

thrift24 picture thrift24  路  4Comments

lbogdan picture lbogdan  路  3Comments

edsiper picture edsiper  路  4Comments