I came across some segmentation faults within OpenSSL internals while changing some things in our crypto module. I was able to reproduce the segmentation faults on master by changing test/parallel/test-crypto-keygen.js to throw an error:
diff --git a/test/parallel/test-crypto-keygen.js b/test/parallel/test-crypto-keygen.js
index ebbac76..62475e7 100644
--- a/test/parallel/test-crypto-keygen.js
+++ b/test/parallel/test-crypto-keygen.js
@@ -276,6 +276,7 @@ const sec1EncExp = (cipher) => getRegExpForPEM('EC PRIVATE KEY', cipher);
}, /bad decrypt|asn1 encoding routines/);
testSignVerify(publicKey, { key: privateKey, passphrase: 'secret' });
+ throw new Error();
}));
}
Output:
~/node$ ./node_g test/parallel/test-crypto-keygen.js
/home/tniessen/node/test/parallel/test-crypto-keygen.js:279
throw new Error();
^
Error
at AsyncWrap.common.mustCall (/.../node/test/parallel/test-crypto-keygen.js:279:11)
at AsyncWrap.<anonymous> (/.../node/test/common/index.js:370:15)
at AsyncWrap.wrap.ondone (internal/crypto/keygen.js:52:14)
Segmentation fault (core dumped)
Trace in GDB:
Core was generated by `./node_g test/parallel/test-crypto-keygen.js'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __GI___pthread_rwlock_wrlock (rwlock=0x0) at pthread_rwlock_wrlock.c:100
100 pthread_rwlock_wrlock.c: No such file or directory.
[Current thread is 1 (Thread 0x7fd5e2bfd700 (LWP 25121))]
(gdb) bt
#0 __GI___pthread_rwlock_wrlock (rwlock=0x0) at pthread_rwlock_wrlock.c:100
#1 0x0000000001dcc442 in CRYPTO_THREAD_write_lock (lock=0x0) at ../deps/openssl/openssl/crypto/threads_pthread.c:73
#2 0x0000000001da12c6 in RAND_get_rand_method () at ../deps/openssl/openssl/crypto/rand/rand_lib.c:744
#3 0x0000000001da14a7 in RAND_priv_bytes (buf=0x7fd5d0006710 "", num=64) at ../deps/openssl/openssl/crypto/rand/rand_lib.c:816
#4 0x0000000001cd6224 in bnrand (flag=PRIVATE, rnd=0x7fd5d0006380, bits=512, top=-1, bottom=0) at ../deps/openssl/openssl/crypto/bn/bn_rand.c:46
#5 0x0000000001cd6688 in bnrand_range (flag=PRIVATE, r=0x7fd5d0006380, range=0x7fd5d0006350) at ../deps/openssl/openssl/crypto/bn/bn_rand.c:162
#6 0x0000000001cd672d in BN_priv_rand_range (r=0x7fd5d0006380, range=0x7fd5d0006350) at ../deps/openssl/openssl/crypto/bn/bn_rand.c:184
#7 0x0000000001cd4cbf in BN_is_prime_fasttest_ex (a=0x7fd5d0005ab0, checks=5, ctx_passed=0x7fd5d0005a20, do_trial_division=0, cb=0x0) at ../deps/openssl/openssl/crypto/bn/bn_prime.c:220
#8 0x0000000001cd47d5 in BN_generate_prime_ex (ret=0x7fd5d0005ab0, bits=512, safe=0, add=0x0, rem=0x0, cb=0x0) at ../deps/openssl/openssl/crypto/bn/bn_prime.c:103
#9 0x0000000001daa85e in rsa_builtin_keygen (rsa=0x7fd5d00060b0, bits=1024, primes=2, e_value=0x7fd5d0000a10, cb=0x0) at ../deps/openssl/openssl/crypto/rsa/rsa_gen.c:164
#10 0x0000000001daa2b7 in RSA_generate_multi_prime_key (rsa=0x7fd5d00060b0, bits=1024, primes=2, e_value=0x7fd5d0000a10, cb=0x0) at ../deps/openssl/openssl/crypto/rsa/rsa_gen.c:61
#11 0x0000000001db2076 in pkey_rsa_keygen (ctx=0x7fd5d00008c0, pkey=0x7fd5d0000b50) at ../deps/openssl/openssl/crypto/rsa/rsa_pmeth.c:742
#12 0x0000000001d6a0fd in EVP_PKEY_keygen (ctx=0x7fd5d00008c0, ppkey=0x7fd5e2bfce00) at ../deps/openssl/openssl/crypto/evp/pmeth_gn.c:108
#13 0x0000000000fbe3df in node::crypto::GenerateKeyPairJob::GenerateKey (this=0x510a840) at ../src/node_crypto.cc:5793
#14 0x0000000000fbe2de in node::crypto::GenerateKeyPairJob::DoThreadPoolWork (this=0x510a840) at ../src/node_crypto.cc:5770
#15 0x0000000000e06a79 in node::ThreadPoolWork::ScheduleWork()::{lambda(uv_work_s*)#1}::operator()(uv_work_s*) const (__closure=0x0, req=0x510a850) at ../src/node_internals.h:233
#16 0x0000000000e06afd in node::ThreadPoolWork::ScheduleWork()::{lambda(uv_work_s*)#1}::_FUN(uv_work_s*) () at ../src/node_internals.h:234
#17 0x0000000001020edc in uv__queue_work (w=0x510a8a8) at ../deps/uv/src/threadpool.c:321
#18 0x000000000102068f in worker (arg=0x0) at ../deps/uv/src/threadpool.c:122
#19 0x00007fd5eaa206ba in start_thread (arg=0x7fd5e2bfd700) at pthread_create.c:333
#20 0x00007fd5ea75641d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
My best guess is that we are shutting down OpenSSL after the error on the main thread occurs, causing background tasks such as key generation to run into problems when trying to use internal resources that have been cleaned up already.
My best guess is that we are shutting down OpenSSL after the error on the main thread occurs, causing background tasks such as key generation to run into problems when trying to use internal resources that have been cleaned up already.
Yes:
[...]
==20103== Possible data race during write of size 4 at 0x23BBF48 by thread #1
==20103== Locks held: 1, at address 0x61F03D0
==20103== at 0x12FDA2F: OPENSSL_cleanup (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0x5DEF040: __run_exit_handlers (exit.c:108)
==20103== by 0x5DEF139: exit (exit.c:139)
==20103== by 0x74BA71: node::Environment::Exit(int) (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0x7A3C9D: node::FatalException(v8::Isolate*, v8::Local<v8::Value>, v8::Local<v8::Message>) (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0x77143D: node::OnMessage(v8::Local<v8::Message>, v8::Local<v8::Value>) (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0xE6C6C0: v8::internal::MessageHandler::ReportMessageNoExceptions(v8::internal::Isolate*, v8::internal::MessageLocation const*, v8::internal::Handle<v8::internal::Object>, v8::Local<v8::Value>) (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0xE6C8B3: v8::internal::MessageHandler::ReportMessage(v8::internal::Isolate*, v8::internal::MessageLocation const*, v8::internal::Handle<v8::internal::JSMessageObject>) (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0xE2CF86: v8::internal::Isolate::ReportPendingMessagesImpl(bool) (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0xD218D7: v8::internal::Execution::Call(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, int, v8::internal::Handle<v8::internal::Object>*) (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0x98E88E: v8::Function::Call(v8::Local<v8::Context>, v8::Local<v8::Value>, int, v8::Local<v8::Value>*) (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0x730CC3: node::InternalMakeCallback(node::Environment*, v8::Local<v8::Object>, v8::Local<v8::Function>, int, v8::Local<v8::Value>*, node::async_context) (in /home/sqrt/src/node/master/out/Release/node)
==20103==
==20103== This conflicts with a previous read of size 4 by thread #10
==20103== Locks held: none
==20103== at 0x12FDCC0: OPENSSL_init_crypto (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0x12D58D2: ERR_get_state (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0x12D7018: ERR_set_mark (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0x13349AE: RSA_generate_multi_prime_key (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0x133A896: pkey_rsa_keygen (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0x12F8812: EVP_PKEY_keygen (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0x8AABDF: node::crypto::GenerateKeyPairJob::DoThreadPoolWork() (in /home/sqrt/src/node/master/out/Release/node)
==20103== by 0x8F3537: worker (threadpool.c:122)
==20103== Address 0x23bbf48 is 0 bytes inside data symbol "stopped"
[...]
@addaleax Do you have an idea how to fix this? The only way I can think of would be to wait for background tasks to finish before performing the cleanup, but I don't know whether that's actually a good idea. You might have more useful insights :)
@tniessen I haven’t checked up with @davisjam’s work on threadpool customization in libuv, but that might be something to check out as a “real” fix – leaving the threadpool implementation to Node.js and draining all currently executing tasks, similar to how we shut down the V8 platform threads before exiting.
Without libuv support, I guess the best thing we could do is to wait for currently executing background tasks to stop before exiting, yes.
Maybe we only need to remove the #ifndef around https://github.com/libuv/libuv/blob/a9e6cc7b9967308cac61414aa907070412461c8f/src/threadpool.c#L163-L185?
@addaleax I don't think I understand. How would that affect the behavior on platforms such as Ubuntu?
@tniessen Oh, I think I misread – it is already enabled except on Windows…
So I guess this might be a static deinitialization order problem – if the libuv cleanup code runs first, everything is fine, because all threads will have come to a stop by the end of their execution. If the OpenSSL cleanup code runs first, we have a problem because we might still be trying to run threads.
Maybe adjusting the priority of the cleanup hook is a way to address this (assuming cross-platform support for that exists)?
@addaleax I believe I am still waiting on a code review on the libuv side from e.g. @bnoordhuis or @cjihrig.
@davisjam Can you ping me in the PR(s)? Maybe I can find time to review things sooner rather than later :)
@addaleax Are we using atexit() for libuv? OpenSSL uses it and there is no way to specify priorities as far as I know, so we will need to perform libuv cleanup ourselves before calling exit(). (I kind of expected that to happen already.)
@tniessen No, we’re not – what we have there is a static destructor. I would assume that these run after atexit(), but I don’t know for sure.
@addaleax That makes sense. So we will probably need to change that and explicitely handle thread shutdowns before calling exit().
I would assume that these run after
atexit(), but I don’t know for sure.
@addaleax You are right :) This is the first thing that happens when NodePlatform::Shutdown runs:
#0 uv_thread_join (tid=0x37f41e0) at ../deps/uv/src/unix/thread.c:250
#1 0x0000000000f0d37f in node::WorkerThreadsTaskRunner::Shutdown (this=0x37f36a0) at ../src/node_platform.cc:217
#2 0x0000000000f0dd13 in node::NodePlatform::Shutdown (this=0x37f3610) at ../src/node_platform.cc:321
#3 0x0000000000ddb750 in node::V8Platform::Dispose (this=0x37ae580 <node::per_process::v8_platform>) at ../src/node_v8_platform-inl.h:100
#4 0x0000000000ddb802 in node::DisposePlatform () at ../src/node_v8_platform-inl.h:179
#5 0x0000000000dd737c in node::Environment::Exit (this=0x7fffffffd650, exit_code=0) at ../src/env.cc:880
Then, the OpenSSL atexit handler runs. This is what causes the segmentation fault.
#0 OPENSSL_cleanup () at ../deps/openssl/openssl/crypto/init.c:459
#1 0x00007ffff6b7cff8 in __run_exit_handlers (status=0, listp=0x7ffff6f075f8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#2 0x00007ffff6b7d045 in __GI_exit (status=<optimized out>) at exit.c:104
#3 0x0000000000dd7386 in node::Environment::Exit (this=0x7fffffffd650, exit_code=0) at ../src/env.cc:881
Finally, libuv's cleanup function is called as a destructor using glibc's _dl_fini call:
#0 uv_thread_join (tid=0x37af480 <default_threads>) at ../deps/uv/src/unix/thread.c:250
#1 0x000000000105575e in cleanup () at ../deps/uv/src/threadpool.c:173
#2 0x00007ffff7de7de7 in _dl_fini () at dl-fini.c:235
#3 0x00007ffff6b7cff8 in __run_exit_handlers (status=0, listp=0x7ffff6f075f8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#4 0x00007ffff6b7d045 in __GI_exit (status=<optimized out>) at exit.c:104
#5 0x0000000000dd7386 in node::Environment::Exit (this=0x7fffffffd650, exit_code=0) at ../src/env.cc:881