test-cli-node-options has been failing a lot on arm lately in CI. I assume it's the bug reported in https://github.com/nodejs/node/issues/21383 ("make test: use after free: parallel/test-cli-node-options").
Sample failure: https://ci.nodejs.org/job/node-test-commit-arm/20786/nodes=ubuntu1604-arm64/consoleText
Host: test-packetnet-ubuntu1604-arm64-2
not ok 184 parallel/test-cli-node-options
---
duration_ms: 1.266
severity: fail
exitcode: 1
stack: |-
assert.js:753
throw newErr;
^
AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/out/Release/node -e console.log("B")
#
# Fatal error in , line 0
# Check failed: (perf_output_handle_) != nullptr.
#
#
#
#FailureMessage Object: 0xffffe553e558
at ChildProcess.exithandler (child_process.js:294:12)
at ChildProcess.emit (events.js:189:13)
at maybeClose (internal/child_process.js:978:16)
at Socket.stream.socket.on (internal/child_process.js:396:11)
at Socket.emit (events.js:189:13)
at Pipe._handle.close (net.js:612:12)
...
And again. Probably time to mark this as flaky. Will open a PR.
test-packetnet-ubuntu1604-arm64-2
https://ci.nodejs.org/job/node-test-commit-arm/20790/nodes=ubuntu1604-arm64/consoleText
not ok 175 parallel/test-cli-node-options
---
duration_ms: 0.986
severity: fail
exitcode: 1
stack: |-
assert.js:753
throw newErr;
^
AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/out/Release/node -e console.log("B")
#
# Fatal error in , line 0
# Check failed: (perf_output_handle_) != nullptr.
#
#
#
#FailureMessage Object: 0xffffcb5fc4d8
at ChildProcess.exithandler (child_process.js:294:12)
at ChildProcess.emit (events.js:189:13)
at maybeClose (internal/child_process.js:978:16)
at Socket.stream.socket.on (internal/child_process.js:396:11)
at Socket.emit (events.js:189:13)
at Pipe._handle.close (net.js:612:12)
...
easily recreated if you run this a hundred times in freebsd
node --perf-basic-prof -e 'console.log("B")'
#
# Fatal error in , line 0
# Check failed: (perf_output_handle_) != nullptr.
#
#
#
(gdb) where
#0 0x00000000015a8eb2 in v8::base::OS::Abort ()
#1 0x0000000000f79516 in v8::internal::PerfBasicLogger::PerfBasicLogger ()
#2 0x0000000000f7fb92 in v8::internal::Logger::SetUp ()
#3 0x0000000000f5b25c in v8::internal::Isolate::Init ()
#4 0x000000000117dd3f in v8::internal::Snapshot::Initialize ()
#5 0x0000000000ace03b in v8::Isolate::Initialize ()
#6 0x00000000008fba32 in node::NewIsolate ()
#7 0x00000000008fdc88 in node::Start ()
#8 0x00000000008fbfdb in node::Start ()
#9 0x00000000008ae095 in _start ()
#10 0x00000008022f0000 in ?? ()
#11 0x0000000000000000 in ?? ()
(gdb)
this corresponds to:
https://github.com/nodejs/node/blob/a6f69ebc05f4033b012b523661f6c3f62f3469b1/deps/v8/src/log.cc#L294-L296
truss output showed this:
24134: open("/tmp/perf-24134.map",O_WRONLY|O_CREAT|O_TRUNC,0666) ERR#13 'Permission denied'
#
# Fatal error in , line 0
# 24134: write(2,"\n\n#\n# Fatal error in , line 0"...,32) = 32 (0x20)
Check failed: (perf_output_handle_) != nullptr.24134: write(2,"Check failed: (perf_output_handl"...,47) =
47 (0x2f)
#
#
#
#FailureMessage Object: 0x7fffffffdc4024134: write(2,"\n#\n#\n#\n#FailureMessage Objec"...,45) = 45 (0x2d
)
$ l /tmp/perf-24134.map
-rw-r--r-- 1 iojs wheel 68110 May 14 2018 /tmp/perf-24134.map
$ whoami
freebsd
$
so this is basically a temporary name collision. Unlikely to happen on production. One of:
PerfBasicLogger::PerfBasicLogger
)Am I reading this correctly, its pid reuse which is causing the tmpfile name collision? Is it possible to configure freebsd to use a larger pid space so the reuse doesn't occur during our tests runs? And does this point to a problem with the tempfile name generation?
@sam-github - yes, this is an issue with PIDs in small ranges that get re-used often, but when composed into file names that is being used by different users causes permission issues, leading to temp file generation failure.
I don't know the admin command to change the PID pattern, and do not have the permission to try that even if I know; though widening the PID even by one digit should resolve collision I guess.
on a side note: though I recreated this in __freebsd__
and the issue is 100% same with the original report, the original failure was reported in arm
. So I am wondering whether fixing it at the host level would be the right thing or not.
Perhaps the test should use the env to set the temp directory to per-user location, like $PWD/out/Release
, to avoid conflict? Perhaps in tools/test.py?
Though I'm a bit confused, are multiple users running the node tests at the same time? If there are multiple parallel test runs by the same user, permission problems shouldn't occur, though other conflicts could.
I don't know; let us ask @nodejs/testing and @Trott
for the first one (setting TMPDIR
prior to the test) I tried; but does not look like being honored; the '/tmp' location seem to be assumed in the v8 code.
for this __freebsd__
failure instance: I use freebsd
user name whereas the username for CI runs seem to be iojs
and that could have triggered this issue. However, given the failure is exactly the same in CI runs as well, its root cause also can be assumed to be the same; i.e, multiple usernames in CI that conflicts when temp files are composed.
I believe FreeBSD uses random PIDs which increases the likelihood of reuse.
I was hoping --perf-basic-prof
could be instructed to put the .map
file someplace else than the default (system temp dir) location with the TMPDIR
environment variable, but alas, that appears to not be the case. If it were, we could do something like this perhaps:
if (!common.isWindows) {
process.env.TMPDIR = '/dev/null';
expect('--perf-basic-prof', 'B\n');
}
(Reading more carefully now, I see @gireeshpunathil already tried that too.)
@nodejs/v8 Is it possible to have the .map
file end up someplace else, or is it pretty much always going to end up in /tmp
?
It looks to be hardcoded :(
https://github.com/nodejs/node/blob/b2f74f73f3087f6c55cfab3b25282f890ffa3e14/deps/v8/src/log.cc#L280
IIRC The linux perf tool (for which this file is generated) looks specifically for files with that name. IOW, the hardcoding patter is probably necessary. Do we need to test --perf-basic-prof
on FreeBSD?
Do we need to test --perf-basic-prof on FreeBSD?
Failure due to this should never happen if the test is always run with the same user. This might be best as a wontfix. We might be overthinking this.
https://ci.nodejs.org/job/node-test-commit-arm/23063/nodes=ubuntu1604-arm64/testReport/(root)/test/parallel_test_cli_node_options_/
test-packetnet-ubuntu1604-arm64-2
assert.js:768
throw newErr;
^
AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/out/Release/node -e console.log("B")
#
# Fatal error in , line 0
# Check failed: (perf_output_handle_) != nullptr.
#
#
#
#FailureMessage Object: 0xffffc0d5e708
at ChildProcess.exithandler (child_process.js:295:12)
at ChildProcess.emit (events.js:193:13)
at maybeClose (internal/child_process.js:1000:16)
at Socket.<anonymous> (internal/child_process.js:405:11)
at Socket.emit (events.js:193:13)
at Pipe.<anonymous> (net.js:593:12)
I see one of these two options as the way out:
arm
and freebsd
systems have much less kern.pid_max
than linux or aix that means this is test is prone to PID collision in those systems. Manually increase this value in all the systems: sysctl -w kern.pid_max=122880
--perf-basic-prof
for arm, and preferably freebsd too.I favor second option, but would like to know how others think.
https://ci.nodejs.org/job/node-test-commit-arm/23304/nodes=ubuntu1604-arm64/consoleFull
test-packetnet-ubuntu1604-arm64-2
00:07:41 not ok 192 parallel/test-cli-node-options # TODO : Fix flaky test
00:07:41 ---
00:07:41 duration_ms: 1.193
00:07:41 severity: flaky
00:07:41 exitcode: 1
00:07:41 stack: |-
00:07:41 assert.js:769
00:07:41 throw newErr;
00:07:41 ^
00:07:41
00:07:41 AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/out/Release/node -e console.log("B")
00:07:41
00:07:41
00:07:41 #
00:07:41 # Fatal error in , line 0
00:07:41 # Check failed: (perf_output_handle_) != nullptr.
00:07:41 #
00:07:41 #
00:07:41 #
00:07:41 #FailureMessage Object: 0xffffe5e1b4d8
00:07:41 at ChildProcess.exithandler (child_process.js:298:12)
00:07:41 at ChildProcess.emit (events.js:194:13)
00:07:41 at maybeClose (internal/child_process.js:1000:16)
00:07:41 at Socket.<anonymous> (internal/child_process.js:405:11)
00:07:41 at Socket.emit (events.js:194:13)
00:07:41 at Pipe.<anonymous> (net.js:593:12)
00:07:41 ...
The test launches more than 20 asynchronous Node.js processes. That seems like it may be too many for under-powered machines (like Raspberry Pi). I wonder if a sufficient fix is to either split it up into 2 or more separate test files that each launch only 4-10 processes asynchronously, or else simply alter the test to use execFileSync()
. That second approach might make the test simpler, and of course will still test NODE_OPTIONS
. Opening a PR soon....
https://ci.nodejs.org/job/node-test-commit-arm/23354/nodes=ubuntu1604-arm64/consoleFull
test-packetnet-ubuntu1604-arm64-2
00:06:58 not ok 187 parallel/test-cli-node-options # TODO : Fix flaky test
00:06:58 ---
00:06:58 duration_ms: 1.163
00:06:58 severity: flaky
00:06:58 exitcode: 1
00:06:58 stack: |-
00:06:58 assert.js:769
00:06:58 throw newErr;
00:06:58 ^
00:06:58
00:06:58 AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/out/Release/node -e console.log("B")
00:06:58
00:06:58
00:06:58 #
00:06:58 # Fatal error in , line 0
00:06:58 # Check failed: (perf_output_handle_) != nullptr.
00:06:58 #
00:06:58 #
00:06:58 #
00:06:58 #FailureMessage Object: 0xffffc1535638
00:06:58 at ChildProcess.exithandler (child_process.js:298:12)
00:06:58 at ChildProcess.emit (events.js:194:13)
00:06:58 at maybeClose (internal/child_process.js:998:16)
00:06:58 at Socket.<anonymous> (internal/child_process.js:403:11)
00:06:58 at Socket.emit (events.js:194:13)
00:06:58 at Pipe.<anonymous> (net.js:593:12)
00:06:58 ...
@trott This seems like it may or may not be a V8 bug. Is there any machine on which this is reproduces sufficiently often so that one could debug this?
@addaleax test-packetnet-ubuntu1604-arm64-2 is the one it's happened on the last few times so that might be a good candidate.
$ ssh test-packetnet-ubuntu1604-arm64-2 cat /proc/cpuinfo | grep ^processor | wc -l
96
parallelism isn't a problem on these machines fwiw
@addaleax I've added you to [email protected] https://github.com/nodejs/build/issues/1747
If we're seeing errors limited to a single machine then we might be running into hardware problems, we've been tracking these but never been able to nail it firmly down https://github.com/nodejs/node/issues/23913
https://ci.nodejs.org/job/node-test-commit-freebsd/25352/nodes=freebsd11-x64/consoleFull
test-digitalocean-freebsd11-x64-1
00:28:53 not ok 242 parallel/test-cli-node-options
00:28:53 ---
00:28:53 duration_ms: 2.676
00:28:53 severity: fail
00:28:53 exitcode: 1
00:28:53 stack: |-
00:28:53 assert.js:769
00:28:53 throw newErr;
00:28:53 ^
00:28:53
00:28:53 AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/out/Release/node -e console.log("B")
00:28:53
00:28:53
00:28:53 #
00:28:53 # Fatal error in , line 0
00:28:53 # Check failed: (perf_output_handle_) != nullptr.
00:28:53 #
00:28:53 #
00:28:53 #
00:28:53 #FailureMessage Object: 0x7fffffffcfe0
00:28:53 at ChildProcess.exithandler (child_process.js:298:12)
00:28:53 at ChildProcess.emit (events.js:194:13)
00:28:53 at maybeClose (internal/child_process.js:998:16)
00:28:53 at Socket.<anonymous> (internal/child_process.js:403:11)
00:28:53 at Socket.emit (events.js:194:13)
00:28:53 at Pipe.<anonymous> (net.js:593:12)
00:28:53 ...
parallelism isn't a problem on these machines fwiw
I'm not saying that parallelism is necessarily the issue with these tests, but the history of our test issues on CI suggests that a high processor count is suggestive of parallelism being the problem, not the other way around.
When there are tests that fail when run at the same time as lots of other tests, it tends to show up on machines in CI that have very high processor counts. Counterintuitive at first, but then when you start troubleshooting, it becomes more obvious: The processor count determines how many tests are run at the same time. If you are running 96 tests at once, you are far more likely to hit two tests that have a previously-unrecognized incompatibility. If you run 4 tests at once, this is far less likely.
Fwiw, I鈥檓 currently trying to debug this on the machine itself, and the test also fails rather frequently when it鈥檚 being run as a standalone script, without other tests being run at the same time.
Fwiw, I鈥檓 currently trying to debug this on the machine itself, and the test also fails rather frequently when it鈥檚 being run as a standalone script, without other tests being run at the same time.
In response to the "Fwiw", I'd say that's worth an awful lot. 馃榾
Oh, yeah, and the results from https://github.com/nodejs/node/pull/26994 sure seem to rule out parallelism being the problem here too. Given that PR, I probably shouldn't have said anything at all. Sorry about the distraction! (EDIT: Although it seemed to feel like the right explanation with the PID re-use thing....)
Although it seemed to feel like the right explanation with the PID re-use thing
Yup, that鈥檚 it. All in all, the issue doesn鈥檛 appear to be super complicated:
When running Node.js with --perf-basic-prof
, like this test does (via NODE_OPTIONS
, but that鈥檚 not relevalent), it will try to create a file named /tmp/perf-${pid}.map
.
https://github.com/nodejs/node/blob/5c2ee4ee8dd8045b3598db219d220988b4c4199e/deps/v8/src/log.cc#L283-L299
If the file already exists and is not writable, the CHECK
in L299 crashes, causing these failures. If the file does not exist, it is created, and never cleaned up.
The host to which I have SSH access has a PID range from 0 to 100000, and currently has 52453 files of that format in /tmp/
, of which 50016 are owned by root
; meaning, it will fail about 50聽% of the time, because tests running under the iojs
user doesn鈥檛 have permissions to overwrite these files.
(I鈥檓 deleting all of these files now on that host, just so that the test passes a bit more often. You can re-create the issue locally like this: https://gist.github.com/addaleax/e7d6db099ae194a3f56e473c9d4c49a6.)
So:
root
and sometimes as iojs
, leading to permissions problems./cc @nodejs/build @nodejs/testing @nodejs/v8
The only circumstances in which tests are run as root in our infra is if they are run manually. We run everything as 'iojs' and it never has sudo access. So I'm suspecting this is a problem with our manual access processes, maybe we need to make it clear that anyone given manual access needs to run tests as 'iojs' and the root SSH access is mainly for convenience. Or maybe we need to give only 'iojs' SSH access most of the time instead?
I think this can be closed because the running-tests-as-root thing isn't something that happens in regular CI runs, but is operator error? Or is that letting ourselves off the hook too easily?
ncu-ci walk commit
only finds one recent example of this error: https://ci.nodejs.org/job/node-test-commit-freebsd/nodes=freebsd11-x64/25647/testReport/(root)/test/parallel_test_cli_node_options/
That was on test-digitalocean-freebsd11-x64-1. For some reason, I am unable to log on to that host to check for root-owned profiling files in /tmp
.
Most helpful comment
Fwiw, I鈥檓 currently trying to debug this on the machine itself, and the test also fails rather frequently when it鈥檚 being run as a standalone script, without other tests being run at the same time.