I have seen multiple errors from this one on Windows this week. sample
not ok 35 parallel/test-async-wrap-uncaughtexception
---
duration_ms: 0.337
severity: fail
stack: |-
Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
at Object.exports.mustCall (c:\workspace\node-test-binary-windows\COMPILED_BY\vs2015\RUNNER\win2008r2-vs2017\RUN_SUBSET\0\test\common\index.js:486:10)
at Object.<anonymous> (c:\workspace\node-test-binary-windows\COMPILED_BY\vs2015\RUNNER\win2008r2-vs2017\RUN_SUBSET\0\test\parallel\test-async-wrap-uncaughtexception.js:14:33)
at Module._compile (module.js:599:30)
at Object.Module._extensions..js (module.js:610:10)
at Module.load (module.js:518:32)
at tryModuleLoad (module.js:481:12)
at Function.Module._load (module.js:473:3)
at Function.Module.runMain (module.js:640:10)
at startup (bootstrap_node.js:187:16)
cc @nodejs/async_hooks @nodejs/platform-windows
This is happening pretty much constantly as of about 12 hours ago. Yikes! I don't see anything obvious that has landed that would cause this.
@nodejs/testing @nodejs/build
I've had this issue in https://github.com/nodejs/node/pull/15538 for a long time (6 days). Where it also appears to be unrelated to my changes.
_edit: fixed wrong link._
Investigating one of the affected CI machines, this does never happen when running node directly (cmd or git bash) but always happens when using test.py
@joaocgreis Taking a guess based on that: This might be because test.py sets up stdio differently than a plain run of Node from the command line, so the stdio objects in Node use different implementations – do you think you could try to verify/disprove that?
@addaleax verified. Redirecting both stdout and stderr to a file makes this happen in cmd. Thanks!
How to check stderr type for future reference
Release\node.exe -e "console.log(util.inspect(process.stderr))"
Release\node.exe -e "console.log(util.inspect(process.stderr))" 2>&1 | cat
Release\node.exe -e "console.log(util.inspect(process.stderr))" > file 2>&1 & type file
Some more notes:
The directory where the test is run makes a difference. In the CI machine where I'm running tests, running in c:\workspace\node-test-binary-windows\COMPILED_BY\vs2015\RUNNER\win2012r2\RUN_SUBSET\0 makes this happen every time, but in any parent or subdir it does never happen. Renaming the directories does not seem to matter, apparently only the number of directories does
I can't make this reproduce in a local VM with the same OS and the same binaries generated by CI that fail in the CI worker. Even in the same directory
Current working directory does not seem to influence, running with full path fails in any directory
Moving the whole test (or just the last group of lines) to a setImmediate makes the test pass
So, the error is that beforeExit is running twice in
Can anything there be scheduling more async work?
I've changed the workspace directory of node-test-binary-windows (simplified it to c:\workspace\node-test-binary-windows\) and haven't seen this happen since in CI, but this is still an issue.
Just now: https://ci.nodejs.org/job/node-test-binary-windows/COMPILED_BY=vcbt2015,RUNNER=win10,RUN_SUBSET=3/12350/console
I think this is not specific to Windows:
https://ci.nodejs.org/job/node-test-commit-linux/13653/nodes=centos7-64/console
https://ci.nodejs.org/job/node-test-commit-linux/13653/nodes=ubuntu1604-64/console
Also I vaguely remember that I have seen this on my macbook, but it does not reproduce much.
Happened twice in a row on Windows 10 in CI:
https://ci.nodejs.org/job/node-test-binary-windows/12364/COMPILED_BY=vcbt2015,RUNNER=win10,RUN_SUBSET=3/console
https://ci.nodejs.org/job/node-test-binary-windows/12365/COMPILED_BY=vcbt2015,RUNNER=win10,RUN_SUBSET=3/console:
not ok 36 parallel/test-async-wrap-uncaughtexception
---
duration_ms: 0.188
severity: fail
stack: |-
Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
at Object.exports.mustCall (c:\workspace\node-test-binary-windows\test\common\index.js:501:10)
at Object.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-async-wrap-uncaughtexception.js:14:33)
at Module._compile (module.js:617:30)
at Object.Module._extensions..js (module.js:628:10)
at Module.load (module.js:536:32)
at tryModuleLoad (module.js:479:12)
at Function.Module._load (module.js:471:3)
at Function.Module.runMain (module.js:658:10)
at startup (bootstrap_node.js:191:16)
Can anything there be scheduling more async work?
process.removeAllListeners() emits removeListener events, so that may be scheduling more work, maybe?
If we're pretty sure the being-emitted-twice thing is not actually a bug but dependent on the state of the machine, easiest fix might be to use process.once() instead. I'll submit a PR now-ish...
Fix (maybe?) in https://github.com/nodejs/node/pull/16598
Seems like we now have a repro - https://ci.nodejs.org/job/node-stress-single-test/1475/
(Ubuntu1604x64).
All the above clues makes me thing this is a memory corruption bug...
git bisect is suggesting that https://github.com/nodejs/node/commit/e567402aba903f7d828d428f1f5738bbf884636c is the source of this problem.
Here's three CI stress test failures in a row on that commit:
The immediately preceding commit is https://github.com/nodejs/node/commit/ab2c35100fb2b67ecea597dbc29134b5c7f1d684. Here's three CI stress test successes in a row on that commit:
(ping @jasnell)
@Trott that commit landed 6 days ago while this issue was opened 18 days ago...I think there could be more to it, might need to audit some crypto error PRs
@joyeecheung Urm...hmmm... Maybe that commit caused the failures to happen on ubuntu 1604 which only started failing more recently? Maybe the bug has deeper roots for Windows?
I did the stress testing on Ubuntu so it would go faster, but I guess I may need to take a deep breath, set aside several hours, and do it on Windows....
@Trott I looked into e567402 again and AFAICT it does not seem to touch the code path tested in test-async-wrap-uncaughtexception. Maybe we could switch to another async hooks provider in the test to see if the root is in crypto?
Hmmm....after 12 hours of nothing but red (presumably due to this issue), Windows is coming back green/yellow/red now...
The constant failures on ubuntu1604-64 stopped too. 🤔
There was a commit that landed around that time that (incidentally) removed the text from the throw Error() in this test...
Argh, nope, still a thing, now showing up on fedora24:
https://ci.nodejs.org/job/node-test-commit-linux/13748/nodes=fedora24/console
...
not ok 136 parallel/test-async-wrap-uncaughtexception
---
duration_ms: 0.138
severity: fail
stack: |-
Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
at Object.exports.mustCall (/home/iojs/build/workspace/node-test-commit-linux/nodes/fedora24/test/common/index.js:490:10)
at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux/nodes/fedora24/test/parallel/test-async-wrap-uncaughtexception.js:14:33)
at Module._compile (module.js:641:30)
at Object.Module._extensions..js (module.js:652:10)
at Module.load (module.js:560:32)
at tryModuleLoad (module.js:503:12)
at Function.Module._load (module.js:495:3)
at Function.Module.runMain (module.js:682:10)
at startup (bootstrap_node.js:191:16)
BTW I have seen this a few times this week on my macbook, but have never been able to reproduce reliably. It only shows up when I run the tests in parallel.
I'm worried about this one at Code + Learn next week because I can totally imagine a third of the room having the test fail locally or something like that....
I'm worried about this one at Code + Learn next week because I can totally imagine a third of the room having the test fail locally or something like that....
Alternative: Marking as flaky https://github.com/nodejs/node/pull/16694 PR
@jasnell RE: #16692 we see either a regression between 8.6.0 and 8.7.0 or some level of non-determinism, so IMHO we shouldn't force the test to pass and so hide the bug
I created a new project to track flakes on CI and put a variety of related issues to this test in a column. Please feel free to add more related issues / prs
FWIW, moving this test to sequential does not solve the problem (tried it in https://github.com/nodejs/node/pull/16733) so having too much other stuff happening on the system is unlikely to be the problem.
I can replicate this if I use the inspector and step into or out of stuff.
I'm starting to suspect that the extra beforeExit call might be caused either by the changes starting in the block starting on line 4708 of src/node.cc in https://github.com/nodejs/node/commit/f27b5e4bdaa or (more likely, perhaps?) the V8 changes that commit is trying to accommodate. Haven't been able to pin it down yet, but if a C++ person wants to do some debugging, please do. :-D (And, of course, I could be totally wrong in my suspicion, just like I've been a multiple times already with this issue.)
@trott The only V8 functionality I know of that will require the change is asynchronous WASM compilation, which is currently behind a flag even in V8 master.
It might be related to f27b5e4, but I’d be surprised since all I can see V8 doing when the test is being executed is scheduling background tasks, no foreground or delayed foreground tasks…
This has been fixed and can be closed now.
Most helpful comment
I created a new project to track flakes on CI and put a variety of related issues to this test in a column. Please feel free to add more related issues / prs