https://ci.nodejs.org/job/node-test-commit-aix/19533/nodes=aix61-ppc64/console
Host: test-osuosl-aix61-ppc64_be-1
00:47:03 not ok 2377 sequential/test-child-process-execsync
00:47:03 ---
00:47:03 duration_ms: 3.701
00:47:03 severity: fail
00:47:03 exitcode: 1
00:47:03 stack: |-
00:47:03 /bin/sh: iamabadcommand: not found.
00:47:03 assert.js:86
00:47:03 throw new AssertionError(obj);
00:47:03 ^
00:47:03
00:47:03 AssertionError [ERR_ASSERTION]: Expected values to be strictly deep-equal:
00:47:03 + actual - expected
00:47:03
00:47:03 + null
00:47:03 - 'SIGILL'
00:47:03 at spawnSyncKeys.forEach (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:138:14)
00:47:03 at Array.forEach (<anonymous>)
00:47:03 at Object.assert.throws (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:136:19)
00:47:03 at expectedException (assert.js:568:19)
00:47:03 at expectsError (assert.js:663:16)
00:47:03 at Function.throws (assert.js:694:3)
00:47:03 at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:127:10)
00:47:03 at Module._compile (internal/modules/cjs/loader.js:721:30)
00:47:03 at Object.Module._extensions..js (internal/modules/cjs/loader.js:732:10)
00:47:03 at Module.load (internal/modules/cjs/loader.js:620:32)
00:47:03 ...
this was reported way back in 2015 through https://github.com/nodejs/node-v0.x-archive/issues/9444 and was fixed through https://github.com/nodejs/node/pull/1214 , don't know what has changed recently. I will have a look.
ok, spent some time on this. there are 2 puzzles to solve:
1.
the stack message (`/bin/sh: iamabadcommand: not found.) indicates that the failure is in this section, https://github.com/nodejs/node/blob/4aabd7ed641db5d78fdf7e1c93b93027ffc3a9b4/test/sequential/test-child-process-execsync.js#L63-L65
where as the actual stack of the failure suggests it is at https://github.com/nodejs/node/blob/4aabd7ed641db5d78fdf7e1c93b93027ffc3a9b4/test/sequential/test-child-process-execsync.js#L138
at which point the actual command that is run as a child is NOT iamabadcommand, instead node -e process.exit(1)
So this warrants a question about at what point the python parent captures the data pertinent to stack: |- as it do not seem to be coherent with the fail site.
AssertionError as shown in the second part to be bona-fide, then it indicates unexpected and improper exit from the child (SIGILL as opposed to null). Though I could not replicate this failure in a 1000 local runs, there is a contextual indication that this is in same as the exit-race with workers, as seen frequently through test-cli-syntax family of tests.I will run more with ulimit -c unlimited to see if I get a core dump, which will mostly tell us a conclusive story.
I ran a 10K times locally, and no sign of failure. I don't think running more is worth.
So I conclude this has to do with the resource (CPU, memory, and other user limits) on CI.
@Trott - is it possible for you to get me ulimit -a output from the failing CI machine? I will attempt to mimic the same values in my dev box and see how it behaves. thanks!
@Trott - is it possible for you to get me ulimit -a output from the failing CI machine? I will attempt to mimic the same values in my dev box and see how it behaves. thanks!
ansible-playbook playbooks/write-ssh-config.yml fails for me so I no longer have access to any hosts. Hopefully someone on @nodejs/build can log in and run the command to get you results. (Alternatively, we can have a CI run it by, for example, creating a branch with a test that runs the command. But that's getting pretty convoluted....)
Woot! Fixed my ansible problem. So here you go:
$ ssh test-osuosl-aix61-ppc64_be-1 ulimit -a
time(seconds) unlimited
file(blocks) 2097151
data(kbytes) 131072
stack(kbytes) 32768
memory(kbytes) 32768
coredump(blocks) 2097151
nofiles(descriptors) 2000
threads(per process) unlimited
processes(per user) unlimited
$
https://ci.nodejs.org/job/node-test-commit-aix/19688/nodes=aix61-ppc64/consoleText
test-osuosl-aix61-ppc64_be-1
not ok 2383 sequential/test-child-process-execsync
---
duration_ms: 2.473
severity: fail
exitcode: 1
stack: |-
/bin/sh: iamabadcommand: not found.
assert.js:86
throw new AssertionError(obj);
^
AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
null !== 1
at Object.assert.throws (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:134:12)
at expectedException (assert.js:568:19)
at expectsError (assert.js:663:16)
at Function.throws (assert.js:694:3)
at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:127:10)
at Module._compile (internal/modules/cjs/loader.js:718:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:729:10)
at Module.load (internal/modules/cjs/loader.js:617:32)
at tryModuleLoad (internal/modules/cjs/loader.js:560:12)
at Function.Module._load (internal/modules/cjs/loader.js:552:3)
...
And again. Probably time to mark this as flaky.
https://ci.nodejs.org/job/node-test-commit-aix/19692/nodes=aix61-ppc64/console
test-osuosl-aix61-ppc64_be-2
21:40:41 not ok 2383 sequential/test-child-process-execsync
21:40:41 ---
21:40:41 duration_ms: 5.83
21:40:41 severity: fail
21:40:41 exitcode: 1
21:40:41 stack: |-
21:40:41 /bin/sh: iamabadcommand: not found.
21:40:41 assert.js:86
21:40:41 throw new AssertionError(obj);
21:40:41 ^
21:40:41
21:40:41 AssertionError [ERR_ASSERTION]: Expected values to be strictly deep-equal:
21:40:41 + actual - expected
21:40:41
21:40:41 + null
21:40:41 - 'SIGILL'
21:40:41 at spawnSyncKeys.forEach (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:138:14)
21:40:41 at Array.forEach (<anonymous>)
21:40:41 at Object.assert.throws (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:136:19)
21:40:41 at expectedException (assert.js:568:19)
21:40:41 at expectsError (assert.js:663:16)
21:40:41 at Function.throws (assert.js:694:3)
21:40:41 at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/sequential/test-child-process-execsync.js:127:10)
21:40:41 at Module._compile (internal/modules/cjs/loader.js:718:30)
21:40:41 at Object.Module._extensions..js (internal/modules/cjs/loader.js:729:10)
21:40:41 at Module.load (internal/modules/cjs/loader.js:617:32)
21:40:41 ...
so the first core dump arrived, but before that; want to clear the confusion in my earlier comment
https://github.com/nodejs/node/issues/24921#issuecomment-445806792 . the failure is in fact in line 134 only, the string /bin/sh: iamabadcommand: not found. is always printed in the console that is suppressed by python harness under successful runs, while printed when failed.
So in summary /bin/sh: iamabadcommand: not found. is a valid output, and not a failure site.
Now, on the failure:
(dbx) where
.() at 0x0
array-buffer-collector._ZN2v88internal20CancelableLambdaTaskIZNS0_20ArrayBufferCollector15FreeAllocationsEvEUlvE_E11RunInternalEv(??) at 0x1011d770c
_ZN2v88internal14CancelableTask3RunEv(??) at 0x100039304
node_platform._ZN4node12_GLOBAL__N_1L20PlatformWorkerThreadEPv(??) at 0x1001e6990
(dbx)
this is here,
https://github.com/nodejs/node/blob/a6f69ebc05f4033b012b523661f6c3f62f3469b1/deps/v8/src/heap/array-buffer-collector.cc#L45
and we have a wild branch - IAR 0.
(dbx) x
$r0:0x00000001011d7678 $stkp:0x0000000110365480 $toc:0x0000000000000000
$r3:0x000000011024b150 $r4:0x0000000101f82e60 $r5:0x0000000000e8a71d
$r6:0x00362b2feca85a5f $r7:0x000000000000007d $r8:0x000000000000007d
$r9:0x0000000000000000 $r10:0x0000000000000000 $r11:0x0000000000000000
$r12:0x09001000a0082b60 $r13:0x000000011036d800 $r14:0x0000000000000000
$r15:0x0000000000000000 $r16:0x0000000000000000 $r17:0x0000000000000000
$r18:0x0000000000000000 $r19:0x0000000000000000 $r20:0x0000000000000000
$r21:0x0000000000000000 $r22:0x0000000000000000 $r23:0x0000000000000000
$r24:0x0000000000000000 $r25:0x000000011024b9e0 $r26:0x0000000000000000
$r27:0x000000011024c770 $r28:0x00000001101fb460 $r29:0x0000000110406b30
$r30:0x0000000000000000 $r31:0x0000000110365480
$iar:0x0000000000000000 $msr:0xa00000000000d032 $cr:0x42000228
$link:0x00000001011d7710 $ctr:0x0000000000000000 $xer:0x40000000
Value in Link Register 0x1011d7710 holds the caller's return address. So the wild branch was made at 0x1011d770c (PPC is RISC, so every instructions are 4 bytes long.)
(dbx) (0x1011d76fc)/10i
0x1011d76fc (_ZN2v88internal20CancelableLambdaTaskIZNS0_20ArrayBufferCollector15FreeAllocationsEvEUlvE_E11RunInternalEv+0x35c) e9490000 ld r10,0x0(r9)
0x1011d7700 (_ZN2v88internal20CancelableLambdaTaskIZNS0_20ArrayBufferCollector15FreeAllocationsEvEUlvE_E11RunInternalEv+0x360) e9690010 ld r11,0x10(r9)
0x1011d7704 (_ZN2v88internal20CancelableLambdaTaskIZNS0_20ArrayBufferCollector15FreeAllocationsEvEUlvE_E11RunInternalEv+0x364) 7d4903a6 mtctr r10
0x1011d7708 (_ZN2v88internal20CancelableLambdaTaskIZNS0_20ArrayBufferCollector15FreeAllocationsEvEUlvE_E11RunInternalEv+0x368) e8490008 ld r2,0x8(r9)
0x1011d770c (_ZN2v88internal20CancelableLambdaTaskIZNS0_20ArrayBufferCollector15FreeAllocationsEvEUlvE_E11RunInternalEv+0x36c) 4e800421 bctrl
(dbx)
Without reading through ~500 instructions it is difficult to pin-point which construct in the source maps to this. However, wild branch typically indicates NULL function pointers, so in this case the lambda.
Is it another manifestation of the exit race with destructors, or a new one altogether? Will study more core files.
Unfortunately all subsequent core files show the same pattern at the same location. So the reason for this type of crash is a wild branch to a NULL lambda target.
To establish its relation with exit race, I changed exit call with underscored exit in Environment::Exit and this type of crash is avoided (in a 100 runs). The test still fails occasionally, but at a different point that also was occurring frequently prior to this change:
AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:
assert(end < SLEEP)
at Object.<anonymous> (/home/iojs/gireesh/exitrace/node/test/sequential/test-child-process-execsync.js:59:3)
So with that, I want to establish that this is root caused by the same issue reported in https://github.com/nodejs/node/issues/25007
Seeing similar looking failure on bsd https://ci.nodejs.org/job/node-test-commit-freebsd/23046/
the test is active for the last one week with no failures, so as expected #25061 has fixed the underlying issue. Closing.