Node: CI failures with no output/stack trace

Created on 9 Apr 2018  Â·  13Comments  Â·  Source: nodejs/node

  • Version: 10.0.0-pre
  • Platform: various Linux platforms
  • Subsystem: test

We've been seeing these for quite some time. They look like this:

not ok 123 parallel/test-some-test
  ---
  duration_ms: 0.456
  severity: fail
  stack: |-

No contents for stdout or stderr but the test failed. Huh? Wha?

Might be a problem in tools/test.py.

@BridgeAR and @addaleax have looked to varying degrees at this issue but haven't discovered anything as far as I know.

@refack and I have a PR open to add exit code to the output when these things happen in case that helps. https://github.com/nodejs/node/pull/19855

@nodejs/testing @nodejs/build

CI / flaky test build libuv test tools

Most helpful comment

@bnoordhuis Okay, this turns out to be a pretty deep rabbit hole.

The reported exit code according to https://github.com/nodejs/node/pull/19855 is -6, which would be SIGABRT – that would explain why we don’t see any output – HasCrashed() in tools/test.py explicitly excludes SIGABRT:

https://github.com/nodejs/node/blob/5e6817261c559390f8a6736fb33e7715f6dfbd16/tools/test.py#L582-L583

Also, I found out there’s a magical place where all the core dumps go on debian machines with systemd – so we actually have those available, too, we just didn’t know that.

This is the stack trace I’m getting from the most recent core dump that I could produce:

#0  0xb775bd40 in __kernel_vsyscall ()
#1  0xb7453367 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#2  0xb7454a23 in __GI_abort () at abort.c:89
#3  0x0857e86c in uv_sem_post (sem=0xbf83b360) at ../deps/uv/src/unix/thread.c:520
#4  0x0856edff in worker (arg=0xbf83b360) at ../deps/uv/src/threadpool.c:55
#5  0xb75d8ecb in start_thread (arg=0xb50ffb40) at pthread_create.c:309
#6  0xb7510d0e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:129

(main thread stack)

#0  0x0850cda5 in node::TCPWrap::Connect(v8::FunctionCallbackInfo<v8::Value> const&) ()
#1  0x0866afc9 in v8::internal::FunctionCallbackArguments::Call(void (*)(v8::FunctionCallbackInfo<v8::Value> const&)) ()
#2  0x086de729 in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) ()
#3  0x086deed1 in v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) ()
#4  0x3fd061be in ?? ()
#5  0x3fd11fd8 in ?? ()
#6  0x3fd0c11d in ?? ()
#7  0x3fd11fd8 in ?? ()
#8  0x3fd11fd8 in ?? ()
#9  0x3fd0c11d in ?? ()
#10 0x3fd11fd8 in ?? ()
#11 0x3fd11fd8 in ?? ()
#12 0x3fd0fe04 in ?? ()
#13 0x3fd09b51 in ?? ()
#14 0x089c589d in v8::internal::Execution::Call(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, int, v8::internal::Handle<v8::internal::Object>*) ()
#15 0x08658bf6 in v8::Function::Call(v8::Local<v8::Context>, v8::Local<v8::Value>, int, v8::Local<v8::Value>*) ()
#16 0x08475359 in node::InternalMakeCallback(node::Environment*, v8::Local<v8::Object>, v8::Local<v8::Function>, int, v8::Local<v8::Value>*, node::async_context) ()
#17 0x0844b4d8 in node::AsyncWrap::MakeCallback(v8::Local<v8::Function>, int, v8::Local<v8::Value>*) ()
#18 0x08450522 in node::cares_wrap::(anonymous namespace)::AfterGetAddrInfo(uv_getaddrinfo_s*, int, addrinfo*) ()
#19 0x085772e0 in uv__getaddrinfo_done (w=0xb342e94, status=0) at ../deps/uv/src/unix/getaddrinfo.c:134
#20 0x0856f23c in uv__work_done (handle=0x9ba2da0 <default_loop_struct+96>) at ../deps/uv/src/threadpool.c:251
#21 0x08571210 in uv__async_io (loop=0x9ba2d40 <default_loop_struct>, w=0x9ba2e30 <default_loop_struct+240>, events=1) at ../deps/uv/src/unix/async.c:118
#22 0x08581ff8 in uv__io_poll (loop=0x9ba2d40 <default_loop_struct>, timeout=-1) at ../deps/uv/src/unix/linux-core.c:400
#23 0x08571b59 in uv_run (loop=0x9ba2d40 <default_loop_struct>, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:368
#24 0x08480456 in node::Start(uv_loop_s*, int, char const* const*, int, char const* const*) ()
#25 0x0847f1da in node::Start(int, char**) ()
#26 0x084416da in main ()

(Also, in test-tls-buffersize, this always happens after the listen callback is finished, but before the socket is registered as connected. I assume that it’s related to moving the DNS lookup to the thread pool.)

The errno for the failing sem_post() call is EINVAL.

It seems this was introduced by https://github.com/libuv/libuv/commit/647fbc026e367cbf0ca2df2b0f72f668e41dccdf – we picked this up in #18260 on Jan 24 on master, which roughly matches my feeling of when these things started happening. Also, reverting it seems to resolve the issue. (I got 7/20000 failures before, and now it’s down to 0/20000.)

I don’t see anything wrong with that commit. However, more debugging, and digging through the glibc source code and its git history brings up https://sourceware.org/bugzilla/show_bug.cgi?id=12674, which seems like it’s exactly the bug that we’re experiencing. (Here’s a pure-C reproduction that is similar to the libuv code and is fixed by https://github.com/bminor/glibc/commit/042e1521c794a945edc43b5bfa7e69ad70420524.)

The fix is included in glibc >= 2.21, but I’m not sure what that means for us going forward.

All 13 comments

Last round of investigation happened in https://github.com/nodejs/node/issues/18998

Two things I found out, which don’t give a lot of insight but might narrow things down a bit:

  • This appears to happen more frequently when tests are not being run in parallel
  • Output that is inserted through process._rawDebug at the beginning of the script is shown by the test runner

Does dmesg show anything afterwards?

@bnoordhuis Unfortunately, no. :/ There are only a handful lines reported after boot on the debian-x86 machine, a few from invalid incoming external UDP packages, and 7 segfault notices for node; but those don’t match the times (or the frequency) with which I could reproduce the issue on that machine.

@bnoordhuis Okay, this turns out to be a pretty deep rabbit hole.

The reported exit code according to https://github.com/nodejs/node/pull/19855 is -6, which would be SIGABRT – that would explain why we don’t see any output – HasCrashed() in tools/test.py explicitly excludes SIGABRT:

https://github.com/nodejs/node/blob/5e6817261c559390f8a6736fb33e7715f6dfbd16/tools/test.py#L582-L583

Also, I found out there’s a magical place where all the core dumps go on debian machines with systemd – so we actually have those available, too, we just didn’t know that.

This is the stack trace I’m getting from the most recent core dump that I could produce:

#0  0xb775bd40 in __kernel_vsyscall ()
#1  0xb7453367 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#2  0xb7454a23 in __GI_abort () at abort.c:89
#3  0x0857e86c in uv_sem_post (sem=0xbf83b360) at ../deps/uv/src/unix/thread.c:520
#4  0x0856edff in worker (arg=0xbf83b360) at ../deps/uv/src/threadpool.c:55
#5  0xb75d8ecb in start_thread (arg=0xb50ffb40) at pthread_create.c:309
#6  0xb7510d0e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:129

(main thread stack)

#0  0x0850cda5 in node::TCPWrap::Connect(v8::FunctionCallbackInfo<v8::Value> const&) ()
#1  0x0866afc9 in v8::internal::FunctionCallbackArguments::Call(void (*)(v8::FunctionCallbackInfo<v8::Value> const&)) ()
#2  0x086de729 in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) ()
#3  0x086deed1 in v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) ()
#4  0x3fd061be in ?? ()
#5  0x3fd11fd8 in ?? ()
#6  0x3fd0c11d in ?? ()
#7  0x3fd11fd8 in ?? ()
#8  0x3fd11fd8 in ?? ()
#9  0x3fd0c11d in ?? ()
#10 0x3fd11fd8 in ?? ()
#11 0x3fd11fd8 in ?? ()
#12 0x3fd0fe04 in ?? ()
#13 0x3fd09b51 in ?? ()
#14 0x089c589d in v8::internal::Execution::Call(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, int, v8::internal::Handle<v8::internal::Object>*) ()
#15 0x08658bf6 in v8::Function::Call(v8::Local<v8::Context>, v8::Local<v8::Value>, int, v8::Local<v8::Value>*) ()
#16 0x08475359 in node::InternalMakeCallback(node::Environment*, v8::Local<v8::Object>, v8::Local<v8::Function>, int, v8::Local<v8::Value>*, node::async_context) ()
#17 0x0844b4d8 in node::AsyncWrap::MakeCallback(v8::Local<v8::Function>, int, v8::Local<v8::Value>*) ()
#18 0x08450522 in node::cares_wrap::(anonymous namespace)::AfterGetAddrInfo(uv_getaddrinfo_s*, int, addrinfo*) ()
#19 0x085772e0 in uv__getaddrinfo_done (w=0xb342e94, status=0) at ../deps/uv/src/unix/getaddrinfo.c:134
#20 0x0856f23c in uv__work_done (handle=0x9ba2da0 <default_loop_struct+96>) at ../deps/uv/src/threadpool.c:251
#21 0x08571210 in uv__async_io (loop=0x9ba2d40 <default_loop_struct>, w=0x9ba2e30 <default_loop_struct+240>, events=1) at ../deps/uv/src/unix/async.c:118
#22 0x08581ff8 in uv__io_poll (loop=0x9ba2d40 <default_loop_struct>, timeout=-1) at ../deps/uv/src/unix/linux-core.c:400
#23 0x08571b59 in uv_run (loop=0x9ba2d40 <default_loop_struct>, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:368
#24 0x08480456 in node::Start(uv_loop_s*, int, char const* const*, int, char const* const*) ()
#25 0x0847f1da in node::Start(int, char**) ()
#26 0x084416da in main ()

(Also, in test-tls-buffersize, this always happens after the listen callback is finished, but before the socket is registered as connected. I assume that it’s related to moving the DNS lookup to the thread pool.)

The errno for the failing sem_post() call is EINVAL.

It seems this was introduced by https://github.com/libuv/libuv/commit/647fbc026e367cbf0ca2df2b0f72f668e41dccdf – we picked this up in #18260 on Jan 24 on master, which roughly matches my feeling of when these things started happening. Also, reverting it seems to resolve the issue. (I got 7/20000 failures before, and now it’s down to 0/20000.)

I don’t see anything wrong with that commit. However, more debugging, and digging through the glibc source code and its git history brings up https://sourceware.org/bugzilla/show_bug.cgi?id=12674, which seems like it’s exactly the bug that we’re experiencing. (Here’s a pure-C reproduction that is similar to the libuv code and is fixed by https://github.com/bminor/glibc/commit/042e1521c794a945edc43b5bfa7e69ad70420524.)

The fix is included in glibc >= 2.21, but I’m not sure what that means for us going forward.

nice work!

For the record, the cores collected by systemd are only there because I had LimitCORE=unlimited turned on and an upgraded systemd installed (via jessie-backports) on the debian 8 machines because @joyeecheung requested core dumps. I've since removed LimitCORE (yesterday, via https://github.com/nodejs/build/pull/1224) but there hasn't been a restart so they are still collecting cores until that happens. If someone needs cores again in the future we'll have to turn them back on. I think the problem that @joyeecheung ran into though was that the matching binaries don't persist so unless you're capturing the cores on the machine yourself, having them generated via our CI runs is not as useful.

I think the problem that @joyeecheung ran into though was that the matching binaries don't persist so unless you're capturing the cores on the machine yourself, having them generated via our CI runs is not as useful.

We could work around that by writing a custom script that saves the binary to somewhere and use it in /proc/sys/kernel/core_pattern, although that could be pretty involved operation-wise. When https://github.com/nodejs/node/pull/19661 lands we might also turn on node-report in tests like this to get the JS stack and native stack, among other useful information.

@addaleax - I was playing around with your C-reproduce, but it always succeeded - is there any precondition for it to fail? my system's glibc is 2.11, and have ample # of CPUs.

@gireeshpunathil The glibc implementation is hardware-dependent, so I guess that might be an issue? Also, weirdly enough, this problem only seemed to reproduce when the machine was not under load…

@addaleax - thanks - that is a great clue. Less load means 2 threads stand great chance to run in 2 real CPUs in true parallel manner, increasing the chance of cluttered access. On the other hand if 2 sem_* threads share CPU, the scheduler will be more pragmatic and take the CPU at some meaningful edges such as system call boundary etc. which shields the problem.

Re-opening until the libuv update lands on master

@bnoordhuis @santigimeno @addaleax do you think this is a big enough issue that it's worth getting a libuv release out today?

@cjihrig I think we can guess the impact of this from looking at the frequency on the CI machine: On Node.js v8.x and v9.x, when used on an OS that ships those old glibc versions, every 3000th Node.js process or so crashes once it tries to do something interesting and the machine isn’t under load.

It’s not the end of the world, but I think it would be good if we could get a fix out in Node.js this month or so, in particular because this has also ended up in an LTS branch.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

srl295 picture srl295  Â·  3Comments

loretoparisi picture loretoparisi  Â·  3Comments

jmichae3 picture jmichae3  Â·  3Comments

mcollina picture mcollina  Â·  3Comments

stevenvachon picture stevenvachon  Â·  3Comments