Since last year I've been having issues switching to 10.x from 6.x because of rare crashes that give no error. I have several production servers with hundreds of users and it's been too difficult to reproduce with a local server so far because it seems to only happens under load and it's fairly rare.
There's a large amount of code that I can't really reduce for testing without affecting my users. Since I haven't had the time or ability to debug this properly I was hoping someone could point me in the right direction, or that this would be helpful to the developers if this is a bug in 10.x or v8.
There are no errors to report, but here are some of the recent segfaults in various syslogs that only happen when a server is on node 10:
Feb 12 21:56:29 EU1 kernel: [22262355.042532] node[20512]: segfault at 3d77ac406468 ip 0000000000efade2 sp 00007f263e259c10 error 4 in node[400000+1e90000]
Mar 13 19:39:09 SEA1 kernel: [2908566.872204] node[4145]: segfault at 23e4891a2fb0 ip 0000000000efa272 sp 00007fbbf4dd0c10 error 4 in node[400000+1e8f000]
Mar 16 08:59:55 SEA1 kernel: [3129393.630360] node[14805]: segfault at 1ac3c9db79c8 ip 0000000000efa272 sp 00007f10629f0c10 error 4 in node[400000+1e8f000]
Mar 16 20:25:29 USW1 kernel: [3173535.851715] node[31823]: segfault at 13aa4aac9a78 ip 0000000000efa272 sp 00007fdb85380c10 error 4 in node[400000+1e8f000]
Mar 17 00:26:56 USE2 kernel: [25489067.874929] node[17011]: segfault at 93788ef0108 ip 0000000000efade2 sp 00007fc14bffec10 error 4 in node[400000+1e90000]
Mar 19 22:10:11 USW1 kernel: [3438995.257871] node[11791]: segfault at 7b0370d05c8 ip 0000000000efa272 sp 00007f88d3403c10 error 4 in node[400000+1e8f000]
Mar 21 11:46:28 USW1 kernel: [3574361.032453] node[18756]: segfault at 10f0170f9b8 ip 0000000000efa272 sp 00007fdb9e281c10 error 4 in node (deleted)[400000+1e8f000]
Mar 27 18:55:30 USE2 kernel: [26419545.476970] node[21011]: segfault at 706d2e8f0b8 ip 0000000000efade2 sp 00007f2b59fadc10 error 4 in node[400000+1e90000]
Apr 1 20:39:24 SEA1 kernel: [319450.383166] node[8710]: segfault at 16f9cfd832b0 ip 0000000000efa272 sp 00007f7850aa2c10 error 4 in node[400000+1e8f000]
Apr 1 20:23:53 USE1 kernel: [26742046.491931] node[4466]: segfault at 3253e97d4310 ip 0000000000efa272 sp 00007f1b2fffec10 error 4 in node[400000+1e8f000]
Apr 2 14:58:52 EU1 kernel: [26470558.192840] node[27273]: segfault at 1bbcf96ba358 ip 0000000000efade2 sp 00007f1e95a03c10 error 4 in node[400000+1e90000]
The module versions have been different throughout the last year and on different servers but here is what is currently being used on USW1 as an example:
I haven't done much testing in node 8.x due to some major garbage collection delay which is another issue altogether. Similarly I haven't had success on version 11.x. I was able to reproduce the crash on my local windows machine once upon a time after days of experimentation on node 10, but not reliably, and I can't seem to get it to occur anymore. It's possible some newer changes have reduced the frequency of this crash because I remember it happening more often in 2018.
I've been mostly using the N version switcher to test different versions. Let me know if there is any other info I can provide or some way to look up these segfault addresses to narrow things down. Thanks!
Have you collected core files? They can be used to get more information about the crash.
Can you add https://github.com/nodejs/node-report to your app? It will give you loggable text output on a crash, that will help a lot. Its builtin to node now, but for 10.x you'll have to add it to your env, its not hard: https://github.com/nodejs/node-report#usage
Also, do you use native addons, and if so, which? (find node_modules -name '*.node'
might help). As far as I can tell, the instruction pointers at the time of the crash refer to two places in the garbage collection code.
To me, that makes it seem like an issue where a core dump would be more important than node-report output, although even the former might be hard to work with because it鈥檚 likely to be some sort of memory corruption issue where it鈥檚 hard to figure out the cause from the crash site.
Thanks for the help! For now I tried enabling core dumps on one of the 10.x servers with ulimit.. assuming I did it right it could take days or weeks to see the crash though.
I forgot to mention that I use PM2 to restart the server process whenever it goes down, hopefully that doesn't matter.
As for native addons I'm not sure.. I ran that find command and it just shows:
node_modules/v8-profiler/build/profiler/v5.7.0/node-v48-linux-x64/profiler.node
That module isn't loaded in by default though so I'm assuming it isn't related to the crash. I have to manually load it during run-time, and have only used it in some rare situations where I needed to profile for CPU bottlenecks. The basic modules I utilize other than ws and mongodb are http, https, os, fs.
I'm not familiar with memory corruption issues, but let me know if there's anything else I can do that might help. I wondered if it might be something in the garbage collector.. I noticed that the GC in 10.x will have a rare delay spike of 2000-3000ms out of nowhere. On 6.x the sweep usually only takes 300ms or so, but they are more often.
The current flags used (on all node versions):
--max-old-space-size=2400 --nouse-idle-notification --expose-gc
I got one pretty quick! A core file from this crash:
Apr 6 16:35:05 USE1 kernel: [27160290.897495] node[5357]: segfault at 1f6b91e12128 ip 0000000000efa272 sp 00007fea19e45c10 error 4 in node[400000+1e8f000]
It's about 1gb in size (90mb compressed). I'd love for someone to take a look at it though I'm not sure the best way to send it, and I'd like to avoid posting it publicly if possible. I added an email to my profile if someone can get in touch :)
@jvictor85 Can you get the stack trace? gdb /path/to/node /path/to/core
, then bt
.
Thanks! I tried that and it said it loaded symbols from node though it just shows these addresses:
Program terminated with signal SIGTRAP, Trace/breakpoint trap.
#0 0x00001b50e646b619 in ?? ()
[Current thread is 1 (LWP 32374)]
(gdb) bt
#0 0x00001b50e646b619 in ?? ()
#1 0x00000005000001ba in ?? ()
#2 0x0000000000000000 in ?? ()
Apparently I had some core files from before and didn't even realize it. Here is the stack trace from the Feb 12th segfault above, it's one of 2 different types of crashes I get it seems - this one at addr 0000000000efade2
and it has some better symbols in it:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
[Current thread is 1 (Thread 0x7f263e25a700 (LWP 20512))]
(gdb) bt
#0 0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1 0x0000000000ee266b in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2 0x0000000000edfd0f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3 0x0000000000bc7166 in v8::internal::CancelableTask::Run() ()
#4 0x00000000009625b9 in node::BackgroundRunner(void*) ()
#5 0x00007f263ee546db in start_thread (arg=0x7f263e25a700) at pthread_create.c:463
#6 0x00007f263eb7d88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
This EU server is running v10.15.0.
Okay, so it's a crash during GC. That probably means memory corruption but unfortunately the real bug is usually not anywhere near the site of the crash, neither in time or space.
What does info maps
print?
Sorry not familiar with gdb but when I run that it says Undefined info command: "maps".
If it's memory corruption during GC I'm guessing this isn't anything I can fix in my own code then.. If it helps I remember trying to isolate this last year and that removing the check if(typeof x === "object")
in a distance calculation function reduced the frequency of crashes by a lot. It made no sense to me but it went from crashing every 10-60 minutes to how it is now, about once a week and only under load. I'm not sure I can even reproduce that frequent crash anymore though, it was so unpredictable that simply adding a line to output something to the console would prevent the crash from happening in some cases.
Sorry, I was thinking of info proc mappings
but that doesn't work with core files, only with live images. It'd be useful to check if the backtrace is always the same, i.e., if the crash site is consistent.
Aside: --nouse-idle-notification
is a bit of lore that survived from Ye Olden Days but it's a no-op nowadays, you don't need it.
Thanks I'll phase out that flag going forward. I tracked down some more core files and collected the backtraces below, along with the node version and date. The one for the SEA server I wasn't sure on the exact version used, so I included backtraces for both possibilities.
SA1 (Node v10.2.1) May 27 2018
Program terminated with signal SIGTRAP, Trace/breakpoint trap.
#0 0x0000275f5cc74c60 in ?? ()
[Current thread is 1 (Thread 0x7f1966f83740 (LWP 2979))]
(gdb) bt
#0 0x0000275f5cc74c60 in ?? ()
#1 0x00007fff661431a0 in ?? ()
#2 0x0000275f5c9b0d79 in ?? ()
etc...
USW1 (Node v10.4.0) June 9 2018
Program terminated with signal SIGTRAP, Trace/breakpoint trap.
#0 0x000017b9ddac0db9 in ?? ()
[Current thread is 1 (Thread 0x7ff9daa14740 (LWP 21738))]
(gdb) bt
#0 0x000017b9ddac0db9 in ?? ()
#1 0x00001bbbc8e822e1 in ?? ()
#2 0x00001bbbc8e822e1 in ?? ()
...
USE2 (Node v10.15.0) March 16 2019
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
[Current thread is 1 (Thread 0x7fc14bfff700 (LWP 17011))]
(gdb) bt
#0 0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1 0x0000000000ee266b in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2 0x0000000000edfd0f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3 0x0000000000bc7166 in v8::internal::CancelableTask::Run() ()
#4 0x00000000009625b9 in node::BackgroundRunner(void*) ()
#5 0x00007fc1519e86db in start_thread (arg=0x7fc14bfff700) at pthread_create.c:463
#6 0x00007fc15171188f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
USW2 (Node v10.13.0) November 18 2018
Program terminated with signal SIGTRAP, Trace/breakpoint trap.
#0 0x000021a9ad996aa4 in ?? ()
[Current thread is 1 (Thread 0x7f83a4de3740 (LWP 2306))]
(gdb) bt
#0 0x000021a9ad996aa4 in ?? ()
#1 0x00000000032b6ce0 in ?? ()
#2 0x0000000000000001 in ?? ()
#3 0x0000000000000000 in ?? ()
SEA1 (Node v10.15.?) February 17 2019
Node v10.15.0:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
[Current thread is 1 (Thread 0x7f13fa7fc700 (LWP 10353))]
(gdb) bt
#0 0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1 0x0000000000ee266b in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2 0x0000000000edfd0f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3 0x0000000000bc7166 in v8::internal::CancelableTask::Run() ()
#4 0x00000000009625b9 in node::BackgroundRunner(void*) ()
#5 0x00007f14010796db in start_thread (arg=0x7f13fa7fc700) at pthread_create.c:463
#6 0x00007f1400da288f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Node v10.15.1:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000efade2 in void v8::internal::BodyDescriptorApply<v8::internal::CallIterateBody, void, v8::internal::Map*, v8::internal::HeapObject*, int, v8::internal::PointersUpdatingVisitor*>(v8::internal::InstanceType, v8::internal::Map*, v8::internal::HeapObject*, int, v8::internal::PointersUpdatingVisitor*) [clone .constprop.755] ()
[Current thread is 1 (LWP 10353)]
(gdb) bt
#0 0x0000000000efade2 in void v8::internal::BodyDescriptorApply<v8::internal::CallIterateBody, void, v8::internal::Map*, v8::internal::HeapObject*, int, v8::internal::PointersUpdatingVisitor*>(v8::internal::InstanceType, v8::internal::Map*, v8::internal::HeapObject*, int, v8::internal::PointersUpdatingVisitor*) [clone .constprop.755] ()
#1 0x0000000000ee266b in v8::internal::MarkCompactCollector::~MarkCompactCollector() ()
#2 0x0000000000edfd0f in v8::internal::ItemParallelJob::Run(std::shared_ptr<v8::internal::Counters>) ()
#3 0x0000000000bc7166 in v8::internal::Cancelable::~Cancelable() ()
#4 0x00000000009625b9 in node::os::Initialize(v8::Local<v8::Object>, v8::Local<v8::Value>, v8::Local<v8::Context>) ()
#5 0x00007f14010796db in ?? ()
#6 0x0000000000000000 in ?? ()
UST1 (Node v10.15.0) February 28 2019
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
[Current thread is 1 (Thread 0x7ff5a77fe700 (LWP 19333))]
(gdb) bt
#0 0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1 0x0000000000ee266b in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2 0x0000000000edfd0f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3 0x0000000000bc7166 in v8::internal::CancelableTask::Run() ()
#4 0x00000000009625b9 in node::BackgroundRunner(void*) ()
#5 0x00007ff5ad9de6db in start_thread (arg=0x7ff5a77fe700) at pthread_create.c:463
#6 0x00007ff5ad70788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Hopefully someone can make sense of these or find a pattern that leads to something. We would love to finally be able to switch to node 10 for good, performance is critical for us. Let me know if there's anything else I can do. Thanks!
Also just got a new one today:
USE1 (Node v10.15.2) April 12 2019
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000efa272 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
[Current thread is 1 (Thread 0x7f4bd59e4700 (LWP 2001))]
(gdb) bt
#0 0x0000000000efa272 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1 0x0000000000ee1afb in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2 0x0000000000edf19f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3 0x0000000000bc65f6 in v8::internal::CancelableTask::Run() ()
#4 0x0000000000962bb9 in node::BackgroundRunner(void*) ()
#5 0x00007f4bd6ddf6db in start_thread (arg=0x7f4bd59e4700) at pthread_create.c:463
#6 0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Many of them seem to point to this same function, though on this server and a couple others it's at the efa272
address rather than efade2
.
the one thing what is worth mentioning is the presence of TRACE_EVENT0
in this routine that internally includes accessing the trace agent; and in the past we have seen exit race
between the main thread and helpers wherein the trace agent was cleaned up while other threads has still pending work to do with.
To confirm, could you please dump the stack of every threads?
(gdb) info thr
then switch to each thread by thr 1
, thr 2
etc. and then issuing bt
if the main thread is seen in the exit route with some advancement into cleanup actions, then we are looking at what I suspect.
Sure thing, here are all 11 threads for the latest core dump from above:
(gdb) info thr
Id Target Id Frame
* 1 Thread 0x7f4bd59e4700 (LWP 2001) 0x0000000000efa272 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
2 Thread 0x7f4bd69e6700 (LWP 1999) 0x00007f4bd6b089ee in __GI_epoll_pwait (epfd=14, events=events@entry=0x7f4bd69e2e00,
maxevents=maxevents@entry=1024, timeout=timeout@entry=-1, set=set@entry=0x0) at ../sysdeps/unix/sysv/linux/epoll_pwait.c:42
3 Thread 0x7f4bd7f68700 (LWP 2004) 0x00007f4bd6de86d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0,
futex_word=0x24a71c0 <node::inspector::(anonymous namespace)::start_io_thread_semaphore>)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
4 Thread 0x7f4bbffff700 (LWP 2005) 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x24a7fc8 <cond+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
5 Thread 0x7f4bbf7fe700 (LWP 2006) 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x24a7fcc <cond+44>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
6 Thread 0x7f4bbeffd700 (LWP 2007) 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x24a7fc8 <cond+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
7 Thread 0x7f4bbe7fc700 (LWP 2008) 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x24a7fc8 <cond+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
8 Thread 0x7f4bd61e5700 (LWP 2000) __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
9 Thread 0x7f4bd7f5b740 (LWP 1998) 0x0000000000efa07a in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
10 Thread 0x7f4bd51e3700 (LWP 2002) 0x00007f4bd6de5b0f in futex_wake (private=<optimized out>, processes_to_wake=2147483647,
futex_word=0x3b13d94) at ../sysdeps/unix/sysv/linux/futex-internal.h:231
11 Thread 0x7f4bd49e2700 (LWP 2003) __GI___libc_free (mem=0x7f4bc81275c0) at malloc.c:3101
[Switching to thread 1 (Thread 0x7f4bd59e4700 (LWP 2001))]
#0 0x0000000000efa272 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
(gdb) bt
#0 0x0000000000efa272 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1 0x0000000000ee1afb in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2 0x0000000000edf19f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3 0x0000000000bc65f6 in v8::internal::CancelableTask::Run() ()
#4 0x0000000000962bb9 in node::BackgroundRunner(void*) ()
#5 0x00007f4bd6ddf6db in start_thread (arg=0x7f4bd59e4700) at pthread_create.c:463
#6 0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
md5-5970d134387a97c1dfd75e4cfc750fca
[Switching to thread 2 (Thread 0x7f4bd69e6700 (LWP 1999))]
#0 0x00007f4bd6b089ee in __GI_epoll_pwait (epfd=14, events=events@entry=0x7f4bd69e2e00, maxevents=maxevents@entry=1024,
timeout=timeout@entry=-1, set=set@entry=0x0) at ../sysdeps/unix/sysv/linux/epoll_pwait.c:42
42 ../sysdeps/unix/sysv/linux/epoll_pwait.c: No such file or directory.
(gdb) bt
#0 0x00007f4bd6b089ee in __GI_epoll_pwait (epfd=14, events=events@entry=0x7f4bd69e2e00, maxevents=maxevents@entry=1024,
timeout=timeout@entry=-1, set=set@entry=0x0) at ../sysdeps/unix/sysv/linux/epoll_pwait.c:42
#1 0x0000000000a56b6f in uv__io_poll (loop=loop@entry=0x3b143e8, timeout=-1) at ../deps/uv/src/unix/linux-core.c:275
#2 0x0000000000a4691b in uv_run (loop=0x3b143e8, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:370
#3 0x0000000000966508 in node::BackgroundTaskRunner::DelayedTaskScheduler::Start()::{lambda(void*)#1}::_FUN(void*) ()
#4 0x00007f4bd6ddf6db in start_thread (arg=0x7f4bd69e6700) at pthread_create.c:463
#5 0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
md5-5970d134387a97c1dfd75e4cfc750fca
[Switching to thread 3 (Thread 0x7f4bd7f68700 (LWP 2004))]
#0 0x00007f4bd6de86d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0,
futex_word=0x24a71c0 <node::inspector::(anonymous namespace)::start_io_thread_semaphore>)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
205 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
#0 0x00007f4bd6de86d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0,
futex_word=0x24a71c0 <node::inspector::(anonymous namespace)::start_io_thread_semaphore>)
at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1 do_futex_wait (sem=sem@entry=0x24a71c0 <node::inspector::(anonymous namespace)::start_io_thread_semaphore>, abstime=0x0)
at sem_waitcommon.c:111
#2 0x00007f4bd6de87c8 in __new_sem_wait_slow (
sem=sem@entry=0x24a71c0 <node::inspector::(anonymous namespace)::start_io_thread_semaphore>, abstime=0x0) at sem_waitcommon.c:181
#3 0x00007f4bd6de8839 in __new_sem_wait (sem=sem@entry=0x24a71c0 <node::inspector::(anonymous namespace)::start_io_thread_semaphore>)
at sem_wait.c:42
#4 0x0000000000a53dc2 in uv__sem_wait (sem=<optimized out>) at ../deps/uv/src/unix/thread.c:579
#5 uv_sem_wait (sem=0x24a71c0 <node::inspector::(anonymous namespace)::start_io_thread_semaphore>) at ../deps/uv/src/unix/thread.c:635
#6 0x00000000009c420a in node::inspector::(anonymous namespace)::StartIoThreadMain(void*) ()
#7 0x00007f4bd6ddf6db in start_thread (arg=0x7f4bd7f68700) at pthread_create.c:463
#8 0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
md5-5970d134387a97c1dfd75e4cfc750fca
[Switching to thread 4 (Thread 0x7f4bbffff700 (LWP 2005))]
#0 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fc8 <cond+40>)
at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
#0 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fc8 <cond+40>)
at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x24a7f60 <mutex>, cond=0x24a7fa0 <cond>) at pthread_cond_wait.c:502
#2 __pthread_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>) at pthread_cond_wait.c:655
#3 0x0000000000a53fe9 in uv_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>)
at ../deps/uv/src/unix/thread.c:735
#4 0x0000000000a4171e in worker (arg=0x0) at ../deps/uv/src/threadpool.c:76
#5 0x00007f4bd6ddf6db in start_thread (arg=0x7f4bbffff700) at pthread_create.c:463
#6 0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
md5-5970d134387a97c1dfd75e4cfc750fca
[Switching to thread 5 (Thread 0x7f4bbf7fe700 (LWP 2006))]
#0 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fcc <cond+44>)
at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88 in ../sysdeps/unix/sysv/linux/futex-internal.h
(gdb) bt
#0 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fcc <cond+44>)
at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x24a7f60 <mutex>, cond=0x24a7fa0 <cond>) at pthread_cond_wait.c:502
#2 __pthread_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>) at pthread_cond_wait.c:655
#3 0x0000000000a53fe9 in uv_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>)
at ../deps/uv/src/unix/thread.c:735
#4 0x0000000000a4171e in worker (arg=0x0) at ../deps/uv/src/threadpool.c:76
#5 0x00007f4bd6ddf6db in start_thread (arg=0x7f4bbf7fe700) at pthread_create.c:463
#6 0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
md5-5970d134387a97c1dfd75e4cfc750fca
[Switching to thread 6 (Thread 0x7f4bbeffd700 (LWP 2007))]
#0 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fc8 <cond+40>)
at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88 in ../sysdeps/unix/sysv/linux/futex-internal.h
(gdb) bt
#0 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fc8 <cond+40>)
at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x24a7f60 <mutex>, cond=0x24a7fa0 <cond>) at pthread_cond_wait.c:502
#2 __pthread_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>) at pthread_cond_wait.c:655
#3 0x0000000000a53fe9 in uv_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>)
at ../deps/uv/src/unix/thread.c:735
#4 0x0000000000a4171e in worker (arg=0x0) at ../deps/uv/src/threadpool.c:76
#5 0x00007f4bd6ddf6db in start_thread (arg=0x7f4bbeffd700) at pthread_create.c:463
#6 0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
md5-5970d134387a97c1dfd75e4cfc750fca
[Switching to thread 7 (Thread 0x7f4bbe7fc700 (LWP 2008))]
#0 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fc8 <cond+40>)
at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88 in ../sysdeps/unix/sysv/linux/futex-internal.h
(gdb) bt
#0 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fc8 <cond+40>)
at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x24a7f60 <mutex>, cond=0x24a7fa0 <cond>) at pthread_cond_wait.c:502
#2 __pthread_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>) at pthread_cond_wait.c:655
#3 0x0000000000a53fe9 in uv_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>)
at ../deps/uv/src/unix/thread.c:735
#4 0x0000000000a4171e in worker (arg=0x0) at ../deps/uv/src/threadpool.c:76
#5 0x00007f4bd6ddf6db in start_thread (arg=0x7f4bbe7fc700) at pthread_create.c:463
#6 0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
md5-5970d134387a97c1dfd75e4cfc750fca
[Switching to thread 8 (Thread 0x7f4bd61e5700 (LWP 2000))]
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135 ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f4bd6de2023 in __GI___pthread_mutex_lock (mutex=0x3b13d58) at ../nptl/pthread_mutex_lock.c:78
#2 0x0000000000a539c9 in uv_mutex_lock (mutex=<optimized out>) at ../deps/uv/src/unix/thread.c:285
#3 0x0000000000962bc1 in node::BackgroundRunner(void*) ()
#4 0x00007f4bd6ddf6db in start_thread (arg=0x7f4bd61e5700) at pthread_create.c:463
#5 0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
md5-5970d134387a97c1dfd75e4cfc750fca
[Switching to thread 9 (Thread 0x7f4bd7f5b740 (LWP 1998))]
#0 0x0000000000efa07a in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
(gdb) bt
#0 0x0000000000efa07a in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1 0x0000000000ee1afb in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2 0x0000000000edf19f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3 0x0000000000edfcd7 in v8::internal::ItemParallelJob::Run(std::shared_ptr<v8::internal::Counters>) ()
#4 0x0000000000eebe68 in v8::internal::MarkCompactCollector::UpdatePointersAfterEvacuation() ()
#5 0x0000000000ef68d4 in v8::internal::MarkCompactCollector::Evacuate() ()
#6 0x0000000000ef7342 in v8::internal::MarkCompactCollector::CollectGarbage() ()
#7 0x0000000000ecf191 in v8::internal::Heap::MarkCompact() ()
#8 0x0000000000ecf881 in v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) ()
#9 0x0000000000ed0484 in v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) ()
#10 0x0000000000ed18bf in v8::internal::Heap::HandleGCRequest() ()
#11 0x0000000000e70c3a in v8::internal::StackGuard::HandleInterrupts() ()
#12 0x000000000113bde5 in v8::internal::Runtime_StackGuard(int, v8::internal::Object**, v8::internal::Isolate*) ()
#13 0x000025b8b33dbe1d in ?? ()
#14 0x0000214054f03dc1 in ?? ()
#15 0x000025b8b33dbd81 in ?? ()
#16 0x00007ffdbf6ff590 in ?? ()
#17 0x0000000000000006 in ?? ()
#18 0x00007ffdbf6ff6a8 in ?? ()
#19 0x000025b8b66b6767 in ?? ()
#20 0x4032000000000000 in ?? ()
#21 0x402a000000000000 in ?? ()
#22 0x0000000000000000 in ?? ()
md5-5970d134387a97c1dfd75e4cfc750fca
[Switching to thread 10 (Thread 0x7f4bd51e3700 (LWP 2002))]
#0 0x00007f4bd6de5b0f in futex_wake (private=<optimized out>, processes_to_wake=2147483647, futex_word=0x3b13d94)
at ../sysdeps/unix/sysv/linux/futex-internal.h:231
231 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
#0 0x00007f4bd6de5b0f in futex_wake (private=<optimized out>, processes_to_wake=2147483647, futex_word=0x3b13d94)
at ../sysdeps/unix/sysv/linux/futex-internal.h:231
#1 __condvar_dec_grefs (private=<optimized out>, g=1, cond=0x3b13d80) at pthread_cond_wait.c:160
#2 __pthread_cond_wait_common (abstime=0x0, mutex=0x3b13d58, cond=0x3b13d80) at pthread_cond_wait.c:559
#3 __pthread_cond_wait (cond=0x3b13d80, mutex=0x3b13d58) at pthread_cond_wait.c:655
#4 0x0000000000a53fe9 in uv_cond_wait (cond=<optimized out>, mutex=<optimized out>) at ../deps/uv/src/unix/thread.c:735
#5 0x0000000000962b64 in node::BackgroundRunner(void*) ()
#6 0x00007f4bd6ddf6db in start_thread (arg=0x7f4bd51e3700) at pthread_create.c:463
#7 0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
md5-5970d134387a97c1dfd75e4cfc750fca
[Switching to thread 11 (Thread 0x7f4bd49e2700 (LWP 2003))]
#0 __GI___libc_free (mem=0x7f4bc81275c0) at malloc.c:3101
3101 malloc.c: No such file or directory.
(gdb) bt
#0 __GI___libc_free (mem=0x7f4bc81275c0) at malloc.c:3101
#1 0x0000000000f15ab8 in void v8::internal::MemoryChunk::ReleaseSlotSet<(v8::internal::RememberedSetType)1>() ()
#2 0x0000000000efa3dc in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#3 0x0000000000ee1afb in v8::internal::PointersUpdatingTask::RunInParallel() ()
#4 0x0000000000edf19f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#5 0x0000000000bc65f6 in v8::internal::CancelableTask::Run() ()
#6 0x0000000000962bb9 in node::BackgroundRunner(void*) ()
#7 0x00007f4bd6ddf6db in start_thread (arg=0x7f4bd49e2700) at pthread_create.c:463
#8 0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Pretty neat. I wonder if there is anything I can change about my code or environment that would encourage this crash, or avoid it. Or maybe this is just something in v8, like the exit race
thing you mentioned? I appreciate the help!
thanks for the quick revert! we don't see the main thread in the list! Except thread 9 whose bottom stack is truncated (due to probably JS code) everything else are helper threads.
in summary, this is not conclusive of what I am suspecting.
looking for ways to get hold of the core dump. Can you send it to my email directly? or upload it in a private gist? (not sure whether the mail box / gitst allow 90mb or not)
Ah too bad it's not what you suspected.. but yes I'm happy to email it. I'll just upload it to my site and send the link shortly.
@jvictor85 - while gdb interpretation of some instructions / rt library missing etc. are posing me challenges in interpreting the failing context, I see this sequence in my local build:
(gdb) x/30i 0xf4c900
0xf4c900 <v8::internal::RememberedSetUpdatingItem::Process+1456>: add BYTE PTR [rdi],cl
0xf4c902 <v8::internal::RememberedSetUpdatingItem::Process+1458>: test DWORD PTR [rcx-0x2],ecx
0xf4c905 <v8::internal::RememberedSetUpdatingItem::Process+1461>: (bad)
0xf4c906 <v8::internal::RememberedSetUpdatingItem::Process+1462>: jmp <internal disassembler error>
0xf4c908 <v8::internal::RememberedSetUpdatingItem::Process+1464>: cdq
0xf4c909 <v8::internal::RememberedSetUpdatingItem::Process+1465>: (bad)
0xf4c90a <v8::internal::RememberedSetUpdatingItem::Process+1466>: (bad)
0xf4c90b <v8::internal::RememberedSetUpdatingItem::Process+1467>: jmp <internal disassembler error>
0xf4c90d <v8::internal::RememberedSetUpdatingItem::Process+1469>: (bad)
0xf4c90e <v8::internal::RememberedSetUpdatingItem::Process+1470>: fldcw WORD PTR [rsi]
0xf4c910 <v8::internal::RememberedSetUpdatingItem::Process+1472>: add BYTE PTR [rax-0x75],cl
0xf4c913 <v8::internal::RememberedSetUpdatingItem::Process+1475>: adc BYTE PTR [rcx-0x44],al
0xf4c916 <v8::internal::RememberedSetUpdatingItem::Process+1478>: pop rsp
0xf4c917 <v8::internal::RememberedSetUpdatingItem::Process+1479>: in al,dx
0xf4c918 <v8::internal::RememberedSetUpdatingItem::Process+1480>: push rcx
0xf4c919 <v8::internal::RememberedSetUpdatingItem::Process+1481>: add cl,BYTE PTR [rax-0x75]
0xf4c91c <v8::internal::RememberedSetUpdatingItem::Process+1484>: push rdx
0xf4c91d <v8::internal::RememberedSetUpdatingItem::Process+1485>: adc BYTE PTR [rax-0x7f],cl
0xf4c920 <v8::internal::RememberedSetUpdatingItem::Process+1488>: cli
0xf4c921 <v8::internal::RememberedSetUpdatingItem::Process+1489>: push rax
0xf4c922 <v8::internal::RememberedSetUpdatingItem::Process+1490>: or BYTE PTR [rsi-0x41f28c00],ch
0xf4c928 <v8::internal::RememberedSetUpdatingItem::Process+1496>: nop
0xf4c929 <v8::internal::RememberedSetUpdatingItem::Process+1497>: pushf
0xf4c92a <v8::internal::RememberedSetUpdatingItem::Process+1498>: mov ch,0x1
0xf4c92c <v8::internal::RememberedSetUpdatingItem::Process+1500>: mov rdi,rax
0xf4c92f <v8::internal::RememberedSetUpdatingItem::Process+1503>: call rdx
0xf4c931 <v8::internal::RememberedSetUpdatingItem::Process+1505>: mov r12,rax
the call to rdx
corresponds to a call to v8::platform::tracing::TracingController::GetCategoryGroupEnabled
with tracing controller as the receiving obejct (obtained through v8::internal::tracing::TraceEventHelper::GetTracingController()
) . this is a very good match with what I described earlier (regarding exit race)
that has a fix in v10.15.3
So while I debug further, can you please try with v10.15.3 and see if these crashes still occur?
Alright that sounds promising! I'll be pleasantly surprised if it's such an easy fix. Thanks so much for spending the time - I'll start transitioning all the servers over today and let you know if I experience any further crashes within the next week or two.
Got another crash, for v10.15.3
Apr 17 13:01:52 SEA1 kernel: [1674284.139116] node[20416]: segfault at 1959ec36ea00 ip 0000000000efe4f2 sp 00007f4373ffec10 error 4 in node[400000+1e96000]
Unfortunately it didn't write the core file.. I thought I set ulimit to do that but maybe it was a temporary setting. I'll try to get that sorted out and deliver the next one as soon as possible.
Another avenue you could try is a debug build. Debug builds have a lot more checks so there's a pretty good chance they'll catch the bug closer to the source. Caveat emptor though: they're also a fair bit slower than release builds.
To compile a debug build, run ./configure --debug && make -j8
; out/Debug/node
is the binary.
Thanks I'll keep the debug build in mind, though I don't believe we can afford the performance hit.
A couple crashes later and I've finally grabbed a core file again. Here is the log and the trace. I'll send the file shortly. This is from today on 10.15.3:
Apr 21 00:19:55 SEA1 kernel: [1974141.941690] node[3940]: segfault at 109dfd061b98 ip 0000000000efe4f2 sp 00007fc7ebd6ec10 error 4 in node[400000+1e96000]
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000efe4f2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
[Current thread is 1 (Thread 0x7fc7ebd6f700 (LWP 3940))]
(gdb) bt
#0 0x0000000000efe4f2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1 0x0000000000ee5d7b in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2 0x0000000000ee341f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3 0x0000000000bca866 in v8::internal::CancelableTask::Run() ()
#4 0x0000000000963d89 in node::BackgroundRunner(void*) ()
#5 0x00007fc7ec9696db in start_thread (arg=0x7fc7ebd6f700) at pthread_create.c:463
#6 0x00007fc7ec69288f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
surprising to note that the crash is in the same method; it makes sense to believe that upgrade to v10.15.3 did not make any difference.
@jvictor85 sent in another core file with v10.15.3
, but the failure sequence is absolutely same -
0xefe4f2 <_ZN2v88internal25RememberedSetUpdatingItemINS0_26MajorNonAtomicMarkingStateEE7ProcessEv+1506>: mov rax,QWORD PTR [rax-0x1]
so looks like the best way is to try with a debug build.
@jvictor85 :
though I don't believe we can afford the performance hit.
this is only to perform problem isolation and determination, definitely not recommending for production.
Thanks for your help!
definitely not recommending for production.
I can try to get a debug build going on a test server, it's just that in my experience this crash tends to show up under load with many real clients connecting. It has happened during testing, but it can take several months to see it occur once, while on production it happens once or twice a week (across many servers).
Haven't had much luck simulating load to reproduce it more often. Though it does seem to occur more on particular servers in certain conditions, so I'll try grabbing database backups from before a crash and see if I can get something to happen. I also haven't tested v12 yet so I'll tinker with that to see if there is a difference.
I remember seeing some issues similar to mine (like #24955) but it seems to be rare.. not sure if that helps narrow things down.
Just to follow up - I've had a couple servers on v12.1.0 for over a week, and recently started moving a few others over as well and so far they haven't experienced this crash at all.
The servers on v10.x still continue to crash. So I'm wondering if something in the v8 code in version 10 (or it could be anything between versions 6 and 12) is causing this.
I'll post an update if the crash ever happens on 12, and will be keeping an eye on updates to 10. I would prefer to use that when possible since I've noticed a bit better GC performance with 10 compared to 12 so far.
@jvictor85 - thanks for the update. Given that we don't have any further clues and debug build dump is unavailable, your action plan looks good to me.
Tested with 10.16.0 and it's still crashing in the same manner. In version 12 it hasn't crashed in a month so it seems to be fixed there. Unfortunately performance is noticeably worse in 12 (same issue with 6) so I try to stick with 10 when possible so the users don't complain aside from the crashes.
May 30 17:48:24 SEA1 kernel: [480538.493040] node[889]: segfault at 4687fcf0af0 ip 0000000000f2aa52 sp 00007fbbc6202c10 error 4 in node[400000+2040000]
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000f2aa52 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Pro cess() ()
[Current thread is 1 (Thread 0x7fbbc6203700 (LWP 889))]
(gdb) bt
#0 0x0000000000f2aa52 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Pro cess() ()
#1 0x0000000000f122db in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2 0x0000000000f0f97f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3 0x0000000000bf5d36 in v8::internal::CancelableTask::Run() ()
#4 0x000000000097fea9 in node::BackgroundRunner(void*) ()
#5 0x00007fbbc7dff6db in start_thread (arg=0x7fbbc6203700) at pthread_create.c:463
#6 0x00007fbbc7b2888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
I was hoping this was a known issue in v8 that was fixed and could be backported to 10 but maybe it was resolved unintentionally through some of the newer changes.
If I can find the time I'll see if I can reproduce this on a test server again, but I'm not too hopeful that I can simulate the right conditions. Maybe someone familiar with the v8 version changes can identify where the fix may have happened?
Alright maybe I found something - I accidentally stumbled on the right conditions for a crash and was able to reproduce it consistently. I reduced the code down to something that can be worked with. I'm not sure if this is the exact crash as on my production servers because the syslog reports it differently, but it behaves in a similar way: quits without an error and only occurs on node 10 and 11. I tested this with node 8 and 12 and don't see the crash. Here's the code:
console.log("Initializing on Node "+process.version);
var seed = 2, clients = [];
var distance = function(x,y,sx,sy) {
let dx,dy;
if(typeof x === "object") {
dx = x.x - y.x;
dy = x.y - y.y;
}
else {
dx = sx - x;
dy = sy - y;
}
return Math.sqrt(dx*dx + dy*dy);
}
var rand = function(from,to) {
let x = Math.sin(seed++) * 10000;
return Math.floor((x - Math.floor(x)) * (to - from + 1)) + from;
}
var make_object = function() {
var obj = { x:rand(0,100), y:rand(0,100) };
obj.update = function() {
for(let i in clients)
distance(this,clients[i]);
distance(this.x,this.y,this.dx,this.dy);
}
return obj;
}
for(var i=0;i<50;i++)
clients.push(make_object());
setInterval(function() {
console.log('Up:'+process.uptime()+' Mem:'+process.memoryUsage().heapUsed);
for(var i=0;i<clients.length;i++)
clients[i].update();
},100);
It looks a little odd because it's a remnant of several thousand lines of code but I left in the lines that cause a quick crash. If you comment out either of the distance calls the program will run fine, even though they don't really do anything.
Tested in windows and linux. I'm hoping this is at least related to the production crash I'm experiencing, but let me know if I'm missing something obvious here..
@jvictor85 - this is great news! thanks for doing this extra stuff. I am able to see the crash excatly matching your description (10, 11 : crash, 8, 12: no crash). Stay tuned, I hope we can make progress now!
@jvictor85
I was working with @gireeshpunathil on this. I am able to reproduce this with debug build also. So I went ahead and did git bisect and finally isolated on to 88786fecff336342a56e6f2e7ff3b286be716e47 which is https://github.com/nodejs/node/pull/18453
as this is now reproducible with pure JS code, I guess now this needs v8 expertise /cc @nodejs/v8
/cc @nodejs/v8
Bisecting V8 shows that this CL fixed the crash in V8 7.3.49: https://chromium-review.googlesource.com/c/1365274
Unfortunately it cannot be applied directly and needs to be backported.
Edit: actually, the conflict was easy to solve. Building with the patch applied...
I opened https://github.com/nodejs/node/pull/28061 which hopefully fixes this without introducing other issues. The standalone testcase from https://github.com/nodejs/node/issues/27107#issuecomment-497912591 doesn't crash with this backport.
Thanks a lot for finding the fix for that crash! I have an update on this - good and bad news. I didn't want to wait any longer for the 10.16.1 release so I started using a nightly build for a couple servers, which I assume has this recent commit to fix the crash. I'm still getting crashes though and I don't think it's from the one I found. So I'm thinking there are two separate crashes and I had just discovered one of them. This might explain why in my original post there are two different instruction pointers throughout the segfault logs.
Got a core file today, using v10.16.1-nightly201906268516da1720:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000f2be52 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
[Current thread is 1 (Thread 0x7fe373da3700 (LWP 9374))]
(gdb) bt
#0 0x0000000000f2be52 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1 0x0000000000f136db in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2 0x0000000000f10d7f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3 0x0000000000bf6f96 in v8::internal::CancelableTask::Run() ()
#4 0x00000000009810e9 in node::BackgroundRunner(void*) ()
#5 0x00007fe37499d6db in start_thread (arg=0x7fe373da3700) at pthread_create.c:463
#6 0x00007fe3746c688f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
This may be the sort that is hard to reproduce unless the server is under a real load. We do have a clue that node version 12 fixes this crash as well. So some other change may need to be found and backported to clear this up.. If anyone could find a good candidate and get a build uploaded somewhere then maybe I could try testing it out to see if it fixes the remaining crash? I understand that may be too much guesswork though, I'm not really familiar with what changed in the v8 code between versions 10 to 12. It just seems like this one is gonna be a pain to reproduce on my end.
@jvictor85 - thanks for the info. The last time isolation was easy at our end because of your standalone test case (really great effort from your end!) . So not having a similar case this time will mean progressing at our end is also going to be hard.
Shall I propose this? Because it is crashing for you under real load (easily?) you could actually do a bisect
between 12 and 10 and pinpoint the infliction point. Some onetime effort for the build setup, but then it should be couple of hours job. Please let me know, if you are ok I can provide discrete details.
ping @jvictor85 - is this still an outstanding issue for you?
Yes unfortunately this remaining crash is still very much an issue for me and my users. The performance in version 12 is still too slow that I end up bouncing between versions 10 and 12 depending on whether I want to go for performance or stability.
Many of my servers that have lower user populations almost never experience the crash.. maybe every few weeks/months so I keep them on version 10. More specifically v10.16.x still has the crash, though I haven't tested with 10.17 I expect it is probably the same. I have 2 higher population servers in particular that experience the crash every couple days, so I try to keep them on v12.
Sorry I wasn't able to dedicate the time required to bisecting on a production server. I understand this is a rare crash that not many people experience so currently I'm waiting on new versions of 12 (or 13) to solve some of the performance regressions so I can fully switch to that. I'll post when that time comes, and let me know if there is any other information I can provide. It's up to you if you want to close this for now or keep it open. Thanks for all the help up to now!
is this still happening? should this remain open?
My performance issues were improved enough by version 12.16.2 (and all versions of 14) that I moved every server over to those. Due to that I never got the chance to test on later versions of 10, but the crash was still happening when I switched over. I'm not sure if anyone else is experiencing this crash or can reproduce it, but I shouldn't be seeing it anymore so I'll close this for now. Thanks again!
Most helpful comment
I opened https://github.com/nodejs/node/pull/28061 which hopefully fixes this without introducing other issues. The standalone testcase from https://github.com/nodejs/node/issues/27107#issuecomment-497912591 doesn't crash with this backport.