Node: addons/openssl-binding/test very flaky on ARM

Created on 4 Dec 2019  Â·  29Comments  Â·  Source: nodejs/node

(Edit: see comment below for recommendations if you are running into related problems on your ARMv7 platforms)

arm-fanned has been offline for quite some time now since they've needed some heavy work, https://github.com/nodejs/build/issues/1840

but I'm bringing them back online now, and it seems that at least one thing has crept in while it was offline. addons/openssl-binding/test is flaky on the Pi 3's. So far I've only seen it fail there but it could be elsewhere too, there haven't been enough runs to be sure.

Error output (@guybedford's run just now) from https://ci.nodejs.org/job/node-test-binary-arm-12+/2928/

Error inout/Release/node': corrupted double-linked list: 0x045ed160`

/cc @addaleax and @danbev since e66a2acc4c looks like a possible candidate.

CI / flaky test arm build

All 29 comments

@rvagg Is there anything in particular that makes you suspect that commit?

valgrind doesn’t report anything suspicious for me.

Can you maybe point me to a host where this reproduces consistently on which I could debug?

@addaleax a very rough guess points me to that commit. arm-fanned has been offline since about July so I was looking at the history of openssl-binding during that time and that's it: https://github.com/nodejs/node/commits/master/test/addons/openssl-binding

Of course it could be elsewhere, or in OpenSSL itself. I'll try and get you instructions for getting in and reproducing but it's a pretty complicated setup for both getting in and running stuff, so give me some time to sort that out. It will probably be made easier when you have nodejs/build onboarding and have your .ssh/config setup properly from Ansible so you have ip addresses, ssh keys and jump hosts all configured for you.

Opened a PR to mark the test flaky for now: https://github.com/nodejs/node/pull/30838

Stack trace:

#0  0x76c01894 in __GI___libc_free (mem=0x4ed5877) at malloc.c:3115
#1  0x005b79b4 in node::native_module::NativeModuleLoader::~NativeModuleLoader() ()
#2  0x76bbc670 in __run_exit_handlers (status=<optimized out>, listp=<optimized out>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#3  0x76bbc798 in __GI_exit (status=<optimized out>) at exit.c:139
#4  0x76ba571c in __libc_start_main (main=0x7e9cc5a4, argc=1993179136, argv=0x76ba571c <__libc_start_main+272>, init=<optimized out>, fini=0x1689314 <__libc_csu_fini>, rtld_fini=0x76f344c4 <_dl_fini>, 
    stack_end=0x7e9cc5a4) at libc-start.c:342
#5  0x004ee458 in _start () at ../sysdeps/arm/start.S:124

Will try to dig more tomorrow, but this might give a hint around what kind of change might have caused this

It seems like, in some way, the buffer->GetBackingStore() call in the addon is problematic; the addon crashes when the rest of the function is commented out, but doesn’t crash when that line is also commented out.

I’m still trying to find out more, but sadly my ability to read ARM assembly isn’t close to where it’s for x86/x64, so it might take a while (esp. given that re-compiling Node.js isn’t practical on the CI machines on which we run the tests).

We build the Node.js binary for arm-fanned CI with -march=armv7-a, but not the addons. Apparently, for std::shared_ptr these two options are ABI-incompatible (I’m assuming they use different kinds of atomics/locks depending on whether the flag is set, but I haven’t dug in that far). I don’t know what this looks like in the release CI.

I think this means we either have to find a way to hide this implementation detail – and that may be somewhat costly – or we have to pick one ABI, either -march=armv7-a or not, and then stick with that. Choosing to always enable the flag would imply dropping armv6 support entirely, instead of it being experimental, I think.

@nodejs/build @nodejs/platform-arm

FWIW I'm still seeing this addon test failure in CI, despite the commit to mark it as flaky having landed awhile back.

FWIW I'm still seeing this addon test failure in CI, despite the commit to mark it as flaky having landed awhile back.

@mscdex That job did mark the openssl-binding test as flaky (notice https://ci.nodejs.org/job/node-test-binary-arm-12+/4122/RUN_SUBSET=addons,label=pi2-docker/ is yellow and not red).

The job was failed because these subjobs hit an NFS issue:
https://ci.nodejs.org/job/node-test-binary-arm-12+/4122/RUN_SUBSET=0,label=pi2-docker/console
https://ci.nodejs.org/job/node-test-binary-arm-12+/4122/RUN_SUBSET=0,label=pi3-docker/console
https://ci.nodejs.org/job/node-test-binary-arm-12+/4122/RUN_SUBSET=3,label=pi3-docker/console

20:18:22 + git clean -fdx
20:18:26 warning: failed to remove out/Release/.nfs00000000002ca76a00000793: Device or resource busy
20:18:26 Removing out/junit

FYI looking into fixing the -march inconsistency is high(ish) on my TODO list, I don't have time right now but it's in my list at least. I don't want to block anyone else from doing it (please!), but I thought it worth flagging that this isn't being ignored.

@rvagg @nodejs/platform-arm I’d like to see us make a decision before v14.0.0 because we’ll be stuck with that decision for the next three years.

@addaleax so do you think that it's defaulting to compiling addons targetting plain armv7? I don't even know where that would be set for addons, in gyp perhaps?

armv7-a does bring a bunch more features along with it over plain armv7: https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html#index-march-2 but it's also likely that we end up with less optimal binaries on a platform where performance is felt acutely. We've been getting away with armv7-a without too many compatibility constraints so I think it's safe to assume that most of where our binaries are run on armv7 that the CPUs are compatible at -a level.

Switching march will mean tweaking https://github.com/nodejs/build/blob/8b1f0a6d530063f0aa6b9ea6d5ff579010914e1d/ansible/roles/cross-compiler/files/cc-selector.sh to handle versioning, which probably means introducing another label in Jenkins and VersionSelectorSwitch to switch on it properly _unless_ we're willing to force this change across all Node versions.

The corollary is that if we change addons to build with armv7-a (somehow) then that'd be a breaking change for node-gyp, and that probably wouldn't make it into 14 anyway because npm is still stuck on node-gyp 5 (while we're planning on pushing out a 7!).

I've got the Pi cluster back online, just running some tests to make sure it's solid before I re-enable it. But I don't really know how we should solve this problem.

The corollary is that if we change addons to build with armv7-a (somehow) then that'd be a breaking change for node-gyp, and that probably wouldn't make it into 14 anyway because npm is still stuck on node-gyp 5

For better or worse (mostly worse but in this case perhaps better), add-ons inherit compiler flags from node's common.gypi. We could just force the right compiler flags onto add-ons in v14.x.

@bnoordhuis What are the “right” compiler flags in this context?

so do you think that it's defaulting to compiling addons targetting plain armv7?

No – armv6, if I remember the investigation I described above correctly. I’m assuming that that’s not a trivial difference.

I guess another option would be to ship two different binaries for arm?

What are the “right” compiler flags in this context?

armv7-a+neon-vfpv3 for -march= and -mcpu= - all Cortex-A's that can run V8 support at least that.

where is armv6 coming from? I can't find that in gyp, do we have it hardwired somewhere?

It's set in config.gypi by configure.py through compiler-based feature detection:

https://github.com/nodejs/node/blob/78e6d2484fcacdee0cc0a1303e49a5328f3470dd/configure.py#L969-L973

The builds on https://unofficial-builds.nodejs.org/ use an armv6 toolchain.

Right, but we're not talking about unofficial-builds as far as I know. This is about flaky tests in https://ci.nodejs.org/job/node-test-binary-arm-12+/ which only has armv7. But @addaleax is saying that they're compiled with armv6, but I don't know of anywhere in the compile process that it would be doing that.

configure.py looks to me like it should be setting arm_version to 7 for these machines. Does that flow down to an explicit -march=armv7 somewhere via gyp that we could tweak?

Kind of. V8's toolchain.gypi sets -march=armv7-a when arm_version==7 but we don't set that globally.

It sounds like the cross-compile toolchain for node-test-binary-arm-12+ defaults to armv6 when it's absent. (arm-rpi-linux-gnueabihf - seems plausible.)

I'll open a PR to set -march=... globally and see how it fares.

@rvagg configure.py doesn’t have any effects that would propagate through to addons, which only use common.gypi, though.

Right sorry, this is about the compiler _on_ the machine building addons, somehow that had escaped my thought process and I was fixated on the cross compilers alone.

The Raspberry Pi's use Docker to run the tests but they are using Raspbian base images. Raspbian still uses a toolchain that defaults to armv6 because it works across all their devices (including the relatively new Zero, which is armv6, so this probably isn't going to change any time soon). So that would explain how we're getting armv6 addons but armv7-a node.

Don't mind me, just catching up to you two!

@bnoordhuis any chance we could slip something in before 14.x that forced -march=armv7-a when armv7 is supplied to configure? I see you have https://github.com/bnoordhuis/io.js/commit/64a49107ebe429baeb95bce854014fca83633c2e, did that work out?

https://github.com/nodejs/node/pull/32704#issuecomment-611950860 - it's causing weird floating-point stability issues that I don't quite understand.

I suggested to @mhdawson that a workaround might be to tell addon authors to prefix their builds (likely using node-gyp) with CC="gcc -march=armv7-a" CXX="g++ -march=armv7-a". He thought this might make a good doc addition for users about this issue if we can show that's a workable solution.

I've added V=1 CC='gcc -march=armv7-a' CXX='g++ -march=armv7-a' as a prefix to the entry into the docker container that builds addons on node-test-binary-arm-12+ and you can see, thanks to V=1, the execution in action: https://ci.nodejs.org/job/node-test-binary-arm-12+/5504/RUN_SUBSET=addons,label=pi3-docker/consoleFull

I'll leave this in for now and maybe it'll demonstrate a temporary workaround that we can suggest to users. It's probably too late to include it in the 14.x release notes. It's also not ideal, node-gyp should be doing this itself as per @bnoordhuis' suggestion.

Perhaps it would be better to _append_ to CFLAGS/CXXFLAGS instead, to avoid overriding the compiler: CFLAGS="$CFLAGS -march=armv7-a" CXXFLAGS="$CXXFLAGS -march=armv7-a"

Good call @mscdex, and that passes nicely through node-gyp. I've updated the CI job to use roughly this and it's working fine: https://ci.nodejs.org/job/node-test-binary-arm-12+/5525/RUN_SUBSET=addons,label=pi3-docker/consoleFull

So the current recommendations for people having trouble with std::shared_ptr on ARMv7 platforms:

  • Our official armv7l binaries available from nodejs.org are compiled explicitly for armv7-a
  • If you are using the Raspberry Pi toolchain to compile addons, (or possibly some other third-party toolchain) on your ARMv7 platform, then it's going to be compiling down to ARMv6. RPi does this to be maximally compatible with their hardware, including the recent Pi Zero which is still ARMv6.
  • There is some kind of ABI incompatibility when it comes to the use of std::shared_ptr between our armv7-a binaries and natively compiled armv6 (most likely armv6zk on RPi) binaries which can cause problems.
  • The workaround, for now at least, is to prefix any addon compiles with CFLAGS="$CFLAGS -march=armv7-a" CXXFLAGS="$CXXFLAGS -march=armv7-a".
  • This should will work by adding this in front of npm install or node-gyp, or whatever other process you're using to build addons.
  • This will make your addons build for ARMv7 only, they won't be compatible with ARMv6 platforms.
  • An alternative, if you want maximum compatibility and straightforward compiles is to use the unofficial armv6l binaries available at: https://unofficial-builds.nodejs.org/download/release/ - just be aware that these are not officially supported and you can't depend on them continuing to be available or not breaking. These binaries are compiled with armv6zk so should work across all Raspberry Pis with modern(ish) Linuxes.

Ping @rvagg ... does this need to remain open? I've been thinking that we need to have a better place of documenting these kinds of persistent issues. Maybe a doc/known-issues folder where they can be checked in...

I would keep this open. This is not just a flaky test, that’s just a symptom of the underlying “real” problem.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vsemozhetbyt picture vsemozhetbyt  Â·  3Comments

Brekmister picture Brekmister  Â·  3Comments

ksushilmaurya picture ksushilmaurya  Â·  3Comments

filipesilvaa picture filipesilvaa  Â·  3Comments

danialkhansari picture danialkhansari  Â·  3Comments