Node: exit race between main and worker threads

Created on 13 Dec 2018  Â·  46Comments  Â·  Source: nodejs/node

  • Version: v11.0.0
  • Platform: all
  • Subsystem: worker, process, src

Sample test case to reproduce the issue:

'use strict'
const { Worker, isMainThread, parentPort } = require('worker_threads')

if (isMainThread) {
  const count = process.argv[2] / 1
  for(var i=0;i<count;i++)
    new Worker(__filename)
  process.exit(0)
} else {
  setInterval(() => {
    parentPort.postMessage('Hello, world!')
  }, 1)
}

The flakiness of the test is largely influenced by the thread scheduling order / number of CPUs / load on the system.

First reported in AIX and Linux through sequential/test-cli-syntax.js. The more you run, the more variety of scenarios you get: SIGSEGV, SIGABRT, SIGILL... depends on at what point the main and the worker threads are.

The root cause is that there is no specified order / identified ownership of C++ global objects that being destructed between threads.

Refs: https://github.com/nodejs/node/issues/24403

C++ confirmed-bug lib / src process

Most helpful comment

@Trott -
these are the things to be understood (by me)

  • how threads used to co-ordinate each other, safely (prior to #23233)
  • what is the order in which threads used to leave the space (again prior to #23233)
  • why #23233 changed the co-ordination, and the order
  • Is #25061 a complete solution for the issue that was introduced
  • what are the issues that pre-date #23233
  • and what solution exists for those

I guess @addaleax may have answers to a few or most of these.

My take is:

  • once we gain confidence on #25061 as remedy for the recent failures, land it
  • revisit #23233 and the issue it was trying to solve, to see if it hints at winder, pre-existing issues
  • identify and develop fixes for those pre-existing issues.
  • use test cases in #25076 #25083 #25085 or anything that we custom develop, as verifiers for the fixes

I am happy to continue engage on this, been reading node_platform.cc to get a grip of things, but not with great success. It would be nice if someone who knows the design brief it to me.

All 46 comments

/cc @nodejs/workers @nodejs/process

on a side note: why would worker module be loaded even in the absence of --experimental-worker flag? I believe the answer is that all the internal modules are loaded at bootstrap, irrespective of their requirement at runtime.

Can we change that? At least in this case, it will save a lot of resources (thread, memory) for use cases that do not require worker?

@bengl was looking at this earlier but I don't know if he has anything to add (and if he did, he'd probably put it in the other issue). Pinging him here anyway just in case...

why would worker module be loaded even in the absence of --experimental-worker flag?

As far as I can tell, it’s only the native binding that’s loaded unconditionally (to figure out whether we’re in a worker or not during bootstrap).

I believe the answer is that all the internal modules are loaded at bootstrap, irrespective of their requirement at runtime.

That’s not the case; test/parallel/test-bootstrap-modules.js tests this.

At least in this case, it will save a lot of resources (thread, memory) for use cases that do not require worker?

The memory overhead is probably not huge, and just loading the worker module does not spawn any threads on its own.

$ gdb ./node_g
(gdb) b pthread_create
Breakpoint 1 at 0xd2fe40
(gdb) r
Starting program: ./node_g 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, 0x0000000000d2fe40 in pthread_create@plt ()
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libgcc-4.8.5-28.el7.x86_64 libstdc++-4.8.5-28.el7.x86_64
(gdb) bt
#0  0x0000000000d2fe40 in pthread_create@plt ()
#1  0x0000000000fd8acf in uv_thread_create (tid=0x3396790, 
    entry=0xe9be78 <node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Start()::{lambda(void*)#1}::_FUN(void*)>, arg=0x3395f60) at ../deps/uv/src/unix/thread.c:213
#2  0x0000000000e9bf1a in node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Start (
    this=0x3395f60) at ../src/node_platform.cc:63
#3  0x0000000000e998e8 in node::WorkerThreadsTaskRunner::WorkerThreadsTaskRunner (
    this=0x3395bf0, thread_pool_size=4) at ../src/node_platform.cc:178
#4  0x0000000000ea795f in __gnu_cxx::new_allocator<node::WorkerThreadsTaskRunner>::construct<node::WorkerThreadsTaskRunner<int&> > (this=0x7fffffffdcfd, __p=0x3395bf0)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/ext/new_allocator.h:120
#5  0x0000000000ea68c4 in std::allocator_traits<std::allocator<node::WorkerThreadsTaskRunner> >::_S_construct<node::WorkerThreadsTaskRunner<int&> >(std::allocator<node::WorkerThreadsTaskRunner>&, std::allocator_traits<std::allocator<node::WorkerThreadsTaskRunner> >::__construct_helper*, (node::WorkerThreadsTaskRunner<int&>&&)...) (__a=..., __p=0x3395bf0)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/alloc_traits.h:253
#6  0x0000000000ea54d5 in std::allocator_traits<std::allocator<node::WorkerThreadsTaskRunner> >::construct<node::WorkerThreadsTaskRunner<int&> >(std::allocator<node::WorkerThreadsTaskRunner>&, node::WorkerThreadsTaskRunner<int&>*, (node::WorkerThreadsTaskRunner<int&>&&)...) (__a=..., 
    __p=0x3395bf0) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/alloc_traits.h:399
#7  0x0000000000ea37aa in std::__shared_ptr<node::WorkerThreadsTaskRunner, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<node::WorkerThreadsTaskRunner>, int&> (
    this=0x7fffffffde30, __tag=..., __a=...)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:1124
#8  0x0000000000ea1692 in std::shared_ptr<node::WorkerThreadsTaskRunner>::shared_ptr<std::alloca---Type <return> to continue, or q <return> to quit---
tor<node::WorkerThreadsTaskRunner>, int&> (this=0x7fffffffde30, __tag=..., __a=...)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:316
#9  0x0000000000e9fe4e in std::allocate_shared<node::WorkerThreadsTaskRunner, std::allocator<node::WorkerThreadsTaskRunner>, int&> (__a=...)
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:588
#10 0x0000000000e9e7fb in std::make_shared<node::WorkerThreadsTaskRunner, int&> ()
    at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:604
#11 0x0000000000e9a18f in node::NodePlatform::NodePlatform (this=0x3395b00, 
    thread_pool_size=4, tracing_controller=0x3395920) at ../src/node_platform.cc:293
#12 0x0000000000dbbdcc in Initialize (this=0x3359680 <node::v8_platform>, thread_pool_size=4)
    at ../src/node.cc:239
#13 0x0000000000dc26d4 in node::InitializeV8Platform (thread_pool_size=4)
    at ../src/node.cc:1893
#14 0x0000000000dc2cd7 in node::Start (argc=1, argv=0x338b4d0) at ../src/node.cc:2122
#15 0x0000000001e13199 in main (argc=1, argv=0x7fffffffe048) at ../src/node_main.cc:126
(gdb) 

@addaleax - this is what I see, am I missing something?

ok, apologies; those are normal worker threads created every time at the bootup. I got confused with worker_thread's thread; my bad.

@gireeshpunathil Okay, that clears it up :)

I think worker_thread’s threads are affected as well, though…

ok, I edited the description to remove worker module from the limelight. I was misguided by the word worker in the WorkerThreadsTaskRunner - that led to writing this test too.

So just to clarify (for myself and others): now this test case only solves the purpose to easily recreate / pronounce the issue through many worker threads, but workers are not real culprits.

As far as I can tell, it’s only the native binding that’s loaded unconditionally (to figure out whether we’re in a worker or not during bootstrap).

I believe that should be unnecessary - opened https://github.com/nodejs/node/pull/25017

Part of debugging #24921 I happened to run the entire CI with underscored exit replacing normal exit in Environment::Exit. The only test that fails in pseudo-tty/test-set-raw-mode-reset-process-exit . I am not advocating in favor of _exit as the favored fix, just stating it here.

on another side, if we want to restrict global destructors from being invoked from helper threads (avoiding the word worker here, so as not to mistake with worker_threads) what would be a starting point? For example we could do:

if (! env->is_main_thread()) return;

leaving that action for the main thread. However, how do we identify such destructors? Do we have a static list somewhere? /cc @addaleax @joyeecheung

@gireeshpunathil I don’t think that’s a feasible solution… the problem occurs mostly when the main thread exits, as far as I have seen so far. Also, I don’t think there’s a safe way for us to tell from a destructor whether we are on the main thread or not (and in the context of embedders there might be multiple main threads).

As for a list of object destructions that are potentially problematic… here’s the best thing I’ve come up with so far based on looking at the symbol tables:

Potentially problematic symbols

$ nm node| grep '[0-9a-f]* [dDbB]'|egrep -v 'args(_[0-9]*_*)*$'|egrep -v '_ZZ?N2v8'|grep -v '_ZZN6icu_63'|grep N4node|grep -v '_ZTV'|grep -v trace_event_unique_atomic|grep -v available_category| awk '{print $3}'|c++filt|sort
guard variable for node::crypto::NewRootCertStore()::root_certs_vector
guard variable for node::crypto::NewRootCertStore()::root_certs_vector_mutex
node::(anonymous namespace)::Parser::settings
node::(anonymous namespace)::Parser::settings
node::cares_wrap::(anonymous namespace)::ares_library_mutex
node::crypto::extra_root_certs_loaded
node::crypto::Initialize(v8::Local<v8::Object>, v8::Local<v8::Value>, v8::Local<v8::Context>, void*)::init_once
node::crypto::NewRootCertStore()::root_certs_vector
node::crypto::NewRootCertStore()::root_certs_vector_mutex
node::crypto::NodeBIO::GetMethod()::method
node::crypto::root_certs
node::crypto::root_cert_store
node::debug_symbols_generated
node::Environment::kNodeContextTagPtr
node::Environment::Start(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool)::init_once
node::Environment::thread_local_env
node::environ_mutex
node::fs::kPathSeparator
node::http2::Headers::Headers(v8::Isolate*, v8::Local<v8::Context>, v8::Local<v8::Array>)::zero
node::http2::Http2Session::callback_struct_saved
node::http2::Origins::Origins(v8::Isolate*, v8::Local<v8::Context>, v8::Local<v8::String>, unsigned long)::zero
node::http_parser_version
node::init_modpending_once
node::inspector::(anonymous namespace)::start_io_thread_async
node::inspector::(anonymous namespace)::start_io_thread_semaphore
node::inspector::protocol::NodeTracing::TraceConfig::RecordModeEnum::RecordAsMuchAsPossible
node::inspector::protocol::NodeTracing::TraceConfig::RecordModeEnum::RecordContinuously
node::inspector::protocol::NodeTracing::TraceConfig::RecordModeEnum::RecordUntilFull
node::inspector::protocol::StringUtil::kNotFound
node::linux_at_secure
node::llhttp_version
node::loader::EXTENSIONS
node::modlist_addon
node::modlist_builtin
node::modlist_internal
node::modlist_linked
node::node_is_initialized
node::node_isolate
node::node_isolate_mutex
node::options_parser::DebugOptionsParser::instance
node::options_parser::EnvironmentOptionsParser::instance
node::options_parser::PerIsolateOptionsParser::instance
node::options_parser::PerProcessOptionsParser::instance
node::performance::performance_node_start
node::performance::performance_v8_start
node::performance::timeOrigin
node::performance::timeOriginTimestamp
node::per_process_loader
node::per_process::metadata
node::per_process_opts
node::per_process_opts_mutex
node::process_mutex
node::prog_start_time
node::PromiseRejectCallback(v8::PromiseRejectMessage)::rejectionsHandledAfter
node::PromiseRejectCallback(v8::PromiseRejectMessage)::unhandledRejections
node::provider_names
node::reverted
node::SigintWatchdogHelper::instance
node::thread_local_modpending
node::tracing::g_agent
node::url::(anonymous namespace)::hex
node::v8_initialized
node::v8_is_profiling
node::v8_platform
node::worker::(anonymous namespace)::next_thread_id
node::worker::(anonymous namespace)::next_thread_id_mutex

thanks @addaleax. this is a pretty huge list!

so what viable options exist for us in your opinion? At the moment it loos like many tests are being affected, and though started with harmless-looking tests in AIX, there is evidence that it has infected some Linux variants.

Also just wondering what recent changes would have triggered this. The current AIX CI which I work with seem to have the best recreate frequency, so do you advise a bisect route? It is a laborious thing to do and probably can be inconclusive, but we might get some vital hints.

@gireeshpunathil Most of the things in these list are not even C++ classes but rather primitive types, so those could be omitted…

I don’t have a clue, as for what might have caused this, or how to bisect this best. And in the worst case, it might just be a timing-based race condition that we caused through some performance changes. :/

Okay, this one looks like a real bug that could be part of this:

==12846== Possible data race during write of size 8 at 0x61E64E0 by thread #1
==12846== Locks held: 1, at address 0x61FD960
==12846==    at 0x7EB227: node::tracing::TracingController::~TracingController() (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x82F11F: node::tracing::Agent::~Agent() (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x74E74C: node::._233::~._233() (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x5DEF040: __run_exit_handlers (exit.c:108)
==12846==    by 0x5DEF139: exit (exit.c:139)
==12846==    by 0x728F40: node::Environment::Exit(int) (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x74E880: node::Exit(v8::FunctionCallbackInfo<v8::Value> const&) (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x9C8E0C: v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x9CA157: v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x1658B2D: ??? (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x17F39368816D: ???
==12846==    by 0x17F39368816D: ???
==12846==
==12846== This conflicts with a previous read of size 8 by thread #11
==12846== Locks held: 2, at addresses 0x649A618 0x649B9A0
==12846==    at 0x709BC5: node::AsyncWrap::AsyncReset(double, bool) (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x72EC0A: node::HandleWrap::HandleWrap(node::Environment*, v8::Local<v8::Object>, uv_handle_s*, node::AsyncWrap::ProviderType) (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x7C7A3D: node::worker::MessagePort::MessagePort(node::Environment*, v8::Local<v8::Context>, v8::Local<v8::Object>) (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x7C7BCE: node::worker::MessagePort::New(v8::FunctionCallbackInfo<v8::Value> const&) (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x9C8682: v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<true>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x9CAA6C: v8::internal::Builtins::InvokeApiFunction(v8::internal::Isolate*, bool, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::Object>, int, v8::internal::Handle<v8::internal::Object>*, v8::internal::Handle<v8::internal::HeapObject>) (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0xCEE0BF: v8::internal::Execution::New(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, int, v8::internal::Handle<v8::internal::Object>*) (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x95D723: v8::Function::NewInstanceWithSideEffectType(v8::Local<v8::Context>, int, v8::Local<v8::Value>*, v8::SideEffectType) const (in /home/sqrt/src/node/master/out/Release/node)
==12846==  Address 0x61e64e0 is 0 bytes inside a block of size 96 alloc'd
==12846==    at 0x4C3184A: operator new(unsigned long) (vg_replace_malloc.c:334)
==12846==    by 0x82EB06: node::tracing::Agent::Agent() (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x754AF7: node::InitializeV8Platform(int) (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x7561D5: node::Start(int, char**) (in /home/sqrt/src/node/master/out/Release/node)
==12846==    by 0x5DCDB96: (below main) (libc-start.c:310)
==12846==  Block was alloc'd by thread #1
==12846==

We probably should at least keep v8_platform alive while other threads are running…?

There’s also at least one other possible race condition in the platform code, which I think it has been caused by e273abc01be9f04511e44cacea20c4f4935b970f, but which I don’t quite understand:

==13124== Possible data race during read of size 1 at 0x1FFEFFF7F0 by thread #1
==13124== Locks held: none
==13124==    at 0x4C33ECD: my_memcmp (hg_intercepts.c:211)
==13124==    by 0x4C341B8: mutex_destroy_WRK (hg_intercepts.c:850)
==13124==    by 0x4C382D1: pthread_mutex_destroy (hg_intercepts.c:873)
==13124==    by 0x8D4CC8: uv_mutex_destroy (thread.c:279)
==13124==    by 0x7E99C7: node::WorkerThreadsTaskRunner::WorkerThreadsTaskRunner(int) (in /home/sqrt/src/node/master/out/Release/node)
==13124==    by 0x7E9BC1: node::NodePlatform::NodePlatform(int, node::tracing::TracingController*) (in /home/sqrt/src/node/master/out/Release/node)
==13124==    by 0x754BD3: node::InitializeV8Platform(int) (in /home/sqrt/src/node/master/out/Release/node)
==13124==    by 0x7561D5: node::Start(int, char**) (in /home/sqrt/src/node/master/out/Release/node)
==13124==    by 0x5DCDB96: (below main) (libc-start.c:310)
==13124== 
==13124== This conflicts with a previous write of size 4 by thread #4
==13124== Locks held: none
==13124==    at 0x5B9E192: __lll_unlock_wake (lowlevellock.S:365)
==13124==    by 0x5B987DE: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:54)
==13124==    by 0x5B987DE: pthread_mutex_unlock (pthread_mutex_unlock.c:345)
==13124==    by 0x4C349B3: mutex_unlock_WRK (hg_intercepts.c:1097)
==13124==    by 0x4C382FD: pthread_mutex_unlock (hg_intercepts.c:1115)
==13124==    by 0x8D4D38: uv_mutex_unlock (thread.c:305)
==13124==    by 0x7E7016: node::(anonymous namespace)::PlatformWorkerThread(void*) (in /home/sqrt/src/node/master/out/Release/node)
==13124==    by 0x4C36FF7: mythread_wrapper (hg_intercepts.c:389)
==13124==    by 0x5B946DA: start_thread (pthread_create.c:463)
==13124==  Address 0x1ffefff7f0 is on thread #1's stack
==13124==  in frame #4, created by node::WorkerThreadsTaskRunner::WorkerThreadsTaskRunner(int) (???:)

thanks @addaleax - I just reverted the changes pertinent to e273abc01be9f04511e44cacea20c4f4935b970f and it did not fail in 100+ runs!

I will let it run for more time tonight and share the result tomorrow.

@gireeshpunathil I just opened #25061 to take up the __run_exit_handlers() problematic again…

I just reverted the changes pertinent to e273abc01be9f04511e44cacea20c4f4935b970f and it did not fail in 100+ runs!

I’ve stared at this code for an hour now, becoming somewhat convinced that it’s correct… this seems like bad news? Does “not fail in 100+ runs” mean that it’s likely to cause these issues here? :/

/cc @ofrobots

@addaleax - yes, 500+ runs now, and no failure. So the said commit is very likely to have caused the issues.

@gireeshpunathil 9f7e3a404096db60438f6dc305b0a5fa4ecddae7 also changed a lot of the code introduced in e273abc01be9f04511e44cacea20c4f4935b970f … do you think you can tell if the former might be the cause rather than the latter?

That would be nice, because it's just cleanup that we could revert without issues (although it would of course be nice to understand this first)

@addaleax - I have run the tests test/sequential/test-child-process-execsync test/parallel/test-child-process-execfile and test/parallel/test-child-process-exit-code enough amount of time with changing the said commits, and got these understanding:

e273abc01be9f04511e44cacea20c4f4935b970f : introduces the issues. Reverting this and 9f7e3a404096db60438f6dc305b0a5fa4ecddae7 together resolves all the failures. Reverting this alone without 9f7e3a404096db60438f6dc305b0a5fa4ecddae7 is neither possible nor meaningful.

9f7e3a404096db60438f6dc305b0a5fa4ecddae7: Does not change anything drastically. Some tests do not fail, but the results are inconclusive. Reverting this alone did not have any visible impact.

https://github.com/addaleax/node/commit/7b38154baf346642ae98d34696ac0f089b889e64 : applied on top of the master, solves all the instances of issues!

I have specific test observations on https://github.com/nodejs/node/pull/25061 , that I will post there. thanks!

@addaleax - I have a new pattern of failure identified. Again, may be a pre-existing one; but persists even with patches from #25061 , #25076 and #25079

(gdb) where
#0  0x00007fac19826207 in raise () from /lib64/libc.so.6
#1  0x00007fac198278f8 in abort () from /lib64/libc.so.6
#2  0x0000000000abdf34 in uv_mutex_lock (mutex=<optimized out>) at ../deps/uv/src/unix/thread.c:286
#3  0x00000000009e2272 in node::EnvGetter(v8::Local<v8::Name>, v8::PropertyCallbackInfo<v8::Value> const&) ()
#4  0x000000000109c15c in v8::internal::(anonymous namespace)::GetPropertyWithInterceptorInternal(v8::internal::LookupIterator*, v8::internal::Handle<v8::internal::InterceptorInfo>, bool*) ()
#5  0x00000000010ad8ef in v8::internal::Object::GetProperty(v8::internal::LookupIterator*, v8::internal::OnNonExistent) ()
#6  0x0000000000ff08dd in v8::internal::LoadIC::Load(v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Name>) ()
#7  0x0000000000ff3ac7 in v8::internal::Runtime_LoadIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*) ()
#8  0x000000000194b8ae in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit ()
#9  0x000000000197182a in Builtins_LdaNamedPropertyHandler ()
#10 0x000026f2fe98816e in ?? ()
#11 0x00000b6bb6c825a1 in ?? ()
#12 0x00000b6bb6c825a1 in ?? ()
#13 0x0000396a29b83051 in ?? ()
#14 0x00002dbad210ecd1 in ?? ()
#15 0x0000396a29b836f9 in ?? ()
#16 0x00002dbad21743e1 in ?? ()
#17 0x00000b6bb6c825a1 in ?? ()
#18 0x00000b6bb6c825a1 in ?? ()
#19 0x00000b6bb6c825a1 in ?? ()
#20 0x0000007600000000 in ?? ()
#21 0x00002d348874baf1 in ?? ()
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb) thr 5
[Switching to thread 5 (Thread 0x7fac1ac03780 (LWP 80857))]
#0  0x00007fac19870faf in _int_free () from /lib64/libc.so.6
(gdb) where
#0  0x00007fac19870faf in _int_free () from /lib64/libc.so.6
#1  0x0000000000955902 in std::_Rb_tree<std::string, std::pair<std::string const, std::string>, std::_Select1st<std::pair<std::string const, std::string> >, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > >::_M_erase(std::_Rb_tree_node<std::pair<std::string const, std::string> >*) ()
#2  0x0000000000955c8f in node::native_module::NativeModuleLoader::~NativeModuleLoader() ()
#3  0x00007fac19829b69 in __run_exit_handlers () from /lib64/libc.so.6
#4  0x00007fac19829bb7 in exit () from /lib64/libc.so.6
#5  0x00000000009206a1 in node::Environment::Exit(int) ()
#6  0x0000000000949e50 in node::Exit(v8::FunctionCallbackInfo<v8::Value> const&) ()
#7  0x0000000000bdd738 in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) ()
#8  0x0000000000bde156 in v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) ()
#9  0x000000000194b8ae in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit ()
#10 0x00002f046088816e in ?? ()
#11 0x00002983ce2025a1 in ?? ()
#12 0x000002b5757a2d41 in ?? ()
#13 0x0000000600000000 in ?? ()
#14 0x00002983ce202681 in ?? ()
#15 0x0000000000000000 in ?? ()
(gdb) 

another one, that is a different variant of the previous one. The main thread fails to destroy the scoped mutex:

#0  0x00007fd5e2a94207 in raise () from /lib64/libc.so.6
#1  0x00007fd5e2a958f8 in abort () from /lib64/libc.so.6
#2  0x0000000000abdf14 in uv_mutex_destroy (mutex=<optimized out>) at ../deps/uv/src/unix/thread.c:280
#3  0x00007fd5e2a97b69 in __run_exit_handlers () from /lib64/libc.so.6
#4  0x00007fd5e2a97bb7 in exit () from /lib64/libc.so.6
#5  0x00000000009206a1 in node::Environment::Exit(int) ()
#6  0x0000000000949e50 in node::Exit(v8::FunctionCallbackInfo<v8::Value> const&) ()

first means a worker wants to acquire a mutex that is already destroyed (by main thread)
second one means the main thread wants to destroy the mutex that is in use (by a worker)

From this, I infer that this could have come from #25061 , as we are force-stopping entities while threads are alive?

further testing revealed that these are not introduced by #25061 instead pre-existing ones.

However, we need a better mechanism to quiesce workers. Probably shutting down the event loop and snapping JS execution is not enough, we should prohibit threads from playing in the C++ land too?

Right now most of the CHECK | abort patterns are based on single threaded model, where as upon failure, these checks should be capable to detect the presence of the exit sequence, and return (long jump) safe and silent, rather than abort?

@gireeshpunathil @addaleax Is there anything someone else can do to help either of you move this/#25061 forward at this point? Or is the best thing to do to sit back for a bit and be patient while one or both of you sort this out?

@Trott -
these are the things to be understood (by me)

  • how threads used to co-ordinate each other, safely (prior to #23233)
  • what is the order in which threads used to leave the space (again prior to #23233)
  • why #23233 changed the co-ordination, and the order
  • Is #25061 a complete solution for the issue that was introduced
  • what are the issues that pre-date #23233
  • and what solution exists for those

I guess @addaleax may have answers to a few or most of these.

My take is:

  • once we gain confidence on #25061 as remedy for the recent failures, land it
  • revisit #23233 and the issue it was trying to solve, to see if it hints at winder, pre-existing issues
  • identify and develop fixes for those pre-existing issues.
  • use test cases in #25076 #25083 #25085 or anything that we custom develop, as verifiers for the fixes

I am happy to continue engage on this, been reading node_platform.cc to get a grip of things, but not with great success. It would be nice if someone who knows the design brief it to me.

I am happy to continue engage on this, been reading node_platform.cc to get a grip of things, but not with great success. It would be nice if someone who knows the design brief it to me.

/ping @addaleax @ak239 @matthewloring @ofrobots @fhinkel

As far as #23233 is concerned, I think anybody with experience in multi-threading/synchronization can feel invited to take a look (this is the closest information we have: https://github.com/nodejs/node/issues/25007#issuecomment-447580133 – not sure I still know how to reproduce that :confused:).

one theory.

there is nothing wrong with #23233, other than the fact that it introduced a delay (couple of millis) for the main thread to bootstrap and proceed w.r.t helpers. (afterall, its changes are strictly confined to the start sequences, nothing to do with the exit route.)

this delay has allowed the helpers to advance more than ever before, w.r.t main thread.

when I prof-profiled a test without and with #23233, I see the ticks in the main isolate has reduced in the later case:

good (prior to #23233):
isolate-0x1fe1d30-v8.log.out:Statistical profiling result from isolate-0x1fe1d30-v8.log, (974 ticks, 1 unaccounted, 0 excluded).

bad (current):
isolate-0x217ee00-v8.log.out:Statistical profiling result from isolate-0x217ee00-v8.log, (751 ticks, 0 unaccounted, 0 excluded).

in small tests such as the one in this issue description, this is critical. When main thread exits, helpers were probably in their own early booting sequence (kernel / glibc / C++ Start routines etc.) . With the wait-and-proceed logic of #23233, helpers are all over in their execution space including JS when main go for exit.

Almost all of the reported CI failures in this story have 'fast exit' as the common theme between those.

I removed everything from #23233 and replaced the cond wait with a single 5 milli delay, and started getting crash (attempt to lock a mutex that is already destroyed ...). This Linux box has 120 CPUs so all helpers were free to run.

But still not the kind of failures we saw in AIX (peculiar scheduling characteristics exist in AIX wherein a newly spawned thread / process may get the CPU which the creator was running.)

Next, I plan to attempt this in an AIX box where we had easy recreate of the crashes. If we get all the execsync, execfile inspector-debug-end sort of failures with this simple delay, I guess we need to close the current chapter through #25061, and revisit pre-existing issues separately.

Suggestions welcome.

@gireeshpunathil Does that mean you're thinking we can/should land #25061 at this time, or would you prefer to wait a bit and gather some more information first?

@Trott - I prefer to defer it until we gather evidence of the above possibility, and one or more SMEs agree that it is the right thing to do!

Hi All. Sorry for not being able to pay attention to this thread sooner.

why #23233 changed the co-ordination, and the order

The previous ordering was unsafe. JS application can call process.exit(), and if that happens while some essential background threads were still initialization, the process would crash - depending on the operating system. It is possible (likely) that there are still more races that are now exposed by the timing changes, but I do think #23233 solves a problem.

thanks @ofrobots for the clarification. I too believe #23233 only modified some timing window / order of execution between main and helper threads that get exposed in platforms / load patterns where these changes opened up race condition when C++ objects are being accessed while being destroyed.

Unfortunately I have not been able to substantiate this with evidence: my approach was to remove the change pertinent to #23233 and introduce artificial delay (order of less than a milli, that matches the overall wait time of main thread for helpers to be ready):

  • in main thread after spawning all the helpers
  • in helpers when those get spawned (after uv_thread_create)
  • sequence the helpers through a dummy mutex (force threads to acquire and release a lock)

none of these yielded the reported race situations - neither in AIX nor in Linux, the platforms where issues were found.

At this point I am looking for new ideas.

It's looking like this has been fixed, or at least things have changed such that it is infrequent again.

screen shot 2018-12-30 at 10 39 32 pm

Any idea what might have done it? Various C++ changes that have landed lately? More powerful or otherwise different AIX box in CI? (@nodejs/build) Something else?

@trott - AFAIK this is still an issue and I can reproduce this on AIX on master in few iterations.

On the other hand, I was looking at this piece of code and wondering:
https://github.com/nodejs/node/blob/ee12607c1cc12f6b366de563004d984a929297b5/src/node_platform.cc#L173

  • why this lock is required? Only main thread would be executing this block of code?
  • what would be the life span on the lock object thus created, especially in the threads that are created from within? do the lock have any bearing with the threads at all? In my experiments, having and not having this lock have proven (with high degree of consistency) to be inflection point.

I guess I got the answer (for the above query) myself as: a mechanism to communicate with the new threads, with the mutex used as a synchronization primitive for the pthread_cond_t.

does the mutex platform_workers_mutex_ get unlocked twice? once when the ScopedLock wrapper goes out of its scope, and next the unique_ptr object that was made to point to PlatformWorkerData structure on thread entry point goes out of its scope?

ok, all previous theories are invalidated with the new findings. I rebased onto 804138093abb2c63cfaba5ee058e18df19b38f3e that has maximum recreate frequency, and got this crash (the consistent pattern in failures in many tests):

_ZN2v88internal20ArrayBufferCollector11FreeingTask11RunInternalEv(??) at 0x1007cdf1c
_ZN2v88internal14CancelableTask3RunEv(??) at 0x100074fb8
node_platform._ZN4node12_GLOBAL__N_1L20PlatformWorkerThreadEPv(??) at 0x10029cec4

The v8::internal::ArrayBufferCollector::FreeingTask::RunInternal code then looks like this:

  void RunInternal() final {
    TRACE_BACKGROUND_GC(
        heap_->tracer(),
        GCTracer::BackgroundScope::BACKGROUND_ARRAY_BUFFER_FREE);
    heap_->array_buffer_collector()->FreeAllocations();
  }

expanding the macro TRACE_BACKGROUND_GC we get a lot of code, but the crashing point is:
v8::internal::tracing::TraceEventHelper::GetTracingController()->GetCategoryGroupEnabled("disabled-by-default-v8.gc");

with return value from GetTracingController being stale or garbage.

0x1007cdbec (RunInternal)       7c0802a6        mflr   r0
...
0x1007cde9c (RunInternal+0x2b0) 4b8a3679          bl   0x100071514 (_ZN2v88internal7tracing16TraceEventHelper20GetTracingControllerEv)    
0x1007cdea0 (RunInternal+0x2b4) 60000000         ori   r0,r0,0x0
0x1007cdea4 (RunInternal+0x2b8) e9420898          ld   r10,0x898(r2)
0x1007cdea8 (RunInternal+0x2bc) e9230000          ld   r9,0x0(r3)
0x1007cdeac (RunInternal+0x2c0) e9290010          ld   r9,0x10(r9)
0x1007cdeb0 (RunInternal+0x2c4) 7fa95000         cmp   cr7,0x1,r9,r10
0x1007cdeb4 (RunInternal+0x2c8) 409e0068         bne   cr7,0x1007cdf1c (RunInternal+0x330) 
...

at this point the expectation is:
r3: return value of GetTracingController() - a node::tracing::TracingController object
r9: vtable of object in r3

0x1007cdf1c (RunInternal+0x330) f8410028         std   r2,0x28(r1)    // store old TOC
0x1007cdf20 (RunInternal+0x334) e8827eb8          ld   r4,0x7eb8(r2)  // get string param into r4 ("disabled-by-default-v8.gc")
0x1007cdf24 (RunInternal+0x338) e9490000          ld   r10,0x0(r9)    // get target code entry into r10 (GetCategoryGroupEnabled)
0x1007cdf28 (RunInternal+0x33c) e9690010          ld   r11,0x10(r9)   // get code bounds into r11
0x1007cdf2c (RunInternal+0x340) 7d4903a6       mtctr   r10            // move code entry into ctr reg.
0x1007cdf30 (RunInternal+0x344) e8490008          ld   r2,0x8(r9)     // load new TOC
0x1007cdf34 (RunInternal+0x348) 4e800421       bctrl                  // branch to ctr reg, invoke GetCategoryGroupEnabled

because r3 was stale / garbage, everything derived from that are bad: r2, r9, r10, r11, ... and ctr was NULL, so we got SIGILL.

So basically this points to

https://github.com/nodejs/node/blob/882e0225d5855a17dafa10820bf1a0b833df8dac/src/node_platform.cc#L461

being corrupt or stale.

Not sure things have changed since then that masks / fixes this.

further testing confirmed that node::tracing::TracingController::~TracingController() has occurred through node::tracing::Agent::~Agent() prior to the concurrent collection tracing that crashed.

so #25061 that reverses the order (dispose the platform and thereby quiesce the workers prior to agent and tracing disposal), should solve this type of issues.

why e273abc01be9f04511e44cacea20c4f4935b970f introduced the issue still remains inconclusive, but looks like it is immaterial now, as this race condition existed always and probably would have exposed via timing changes through it.

btw, this question
does the mutex platform_workers_mutex_ get unlocked twice?
still lingers in my mind.

so #25061 that reverses the order (dispose the platform and thereby quiesce the workers prior to agent and tracing disposal), should solve this type of issues.

This makes sense. Thanks for the investigation :+1:.

@addaleax @gireeshpunathil @ofrobots Based on https://github.com/nodejs/node/issues/25007#issuecomment-451154661, I'd be inclined land https://github.com/nodejs/node/pull/25061 in its current state (removing the WIP label). Would you agree or is it preferable to figure out more about why e273abc amplified the existing problem (on AIX, at least)?

I am in favor of landing #25061 as a remedy to the current exit race situations; while addressing pre-existing issues on a case basis.

almost all of the tests that had manifested exit-race issues are back to work, so if those are all happy for the next one week or so in regular CI, we should be good with this issue.

this pattern is not observed anymore in CI, thanks to #25061 . So good to close.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

TazmanianDI picture TazmanianDI  Â·  127Comments

speakeasypuncture picture speakeasypuncture  Â·  152Comments

feross picture feross  Â·  208Comments

yury-s picture yury-s  Â·  89Comments

mikeal picture mikeal  Â·  197Comments