Node: Flaky test-async-wrap-uncaughtexception

Created on 15 Oct 2017  Â·  32Comments  Â·  Source: nodejs/node

  • Version: master
  • Platform: Windows
  • Subsystem: test

I have seen multiple errors from this one on Windows this week. sample

not ok 35 parallel/test-async-wrap-uncaughtexception
  ---
  duration_ms: 0.337
  severity: fail
  stack: |-
    Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
        at Object.exports.mustCall (c:\workspace\node-test-binary-windows\COMPILED_BY\vs2015\RUNNER\win2008r2-vs2017\RUN_SUBSET\0\test\common\index.js:486:10)
        at Object.<anonymous> (c:\workspace\node-test-binary-windows\COMPILED_BY\vs2015\RUNNER\win2008r2-vs2017\RUN_SUBSET\0\test\parallel\test-async-wrap-uncaughtexception.js:14:33)
        at Module._compile (module.js:599:30)
        at Object.Module._extensions..js (module.js:610:10)
        at Module.load (module.js:518:32)
        at tryModuleLoad (module.js:481:12)
        at Function.Module._load (module.js:473:3)
        at Function.Module.runMain (module.js:640:10)
        at startup (bootstrap_node.js:187:16)
CI / flaky test async_hooks async_wrap test

Most helpful comment

I created a new project to track flakes on CI and put a variety of related issues to this test in a column. Please feel free to add more related issues / prs

All 32 comments

cc @nodejs/async_hooks @nodejs/platform-windows

This is happening pretty much constantly as of about 12 hours ago. Yikes! I don't see anything obvious that has landed that would cause this.

@nodejs/testing @nodejs/build

I've had this issue in https://github.com/nodejs/node/pull/15538 for a long time (6 days). Where it also appears to be unrelated to my changes.

_edit: fixed wrong link._

Investigating one of the affected CI machines, this does never happen when running node directly (cmd or git bash) but always happens when using test.py

@joaocgreis Taking a guess based on that: This might be because test.py sets up stdio differently than a plain run of Node from the command line, so the stdio objects in Node use different implementations – do you think you could try to verify/disprove that?

@addaleax verified. Redirecting both stdout and stderr to a file makes this happen in cmd. Thanks!


How to check stderr type for future reference

Release\node.exe -e "console.log(util.inspect(process.stderr))"
Release\node.exe -e "console.log(util.inspect(process.stderr))" 2>&1 | cat
Release\node.exe -e "console.log(util.inspect(process.stderr))" > file 2>&1 & type file


Some more notes:

  • The directory where the test is run makes a difference. In the CI machine where I'm running tests, running in c:\workspace\node-test-binary-windows\COMPILED_BY\vs2015\RUNNER\win2012r2\RUN_SUBSET\0 makes this happen every time, but in any parent or subdir it does never happen. Renaming the directories does not seem to matter, apparently only the number of directories does

  • I can't make this reproduce in a local VM with the same OS and the same binaries generated by CI that fail in the CI worker. Even in the same directory

  • Current working directory does not seem to influence, running with full path fails in any directory

  • Moving the whole test (or just the last group of lines) to a setImmediate makes the test pass

So, the error is that beforeExit is running twice in

https://github.com/nodejs/node/blob/006fdb2fe5d9d1681f25c44009242304740ce154/test/parallel/test-async-wrap-uncaughtexception.js#L14-L19

Can anything there be scheduling more async work?

I've changed the workspace directory of node-test-binary-windows (simplified it to c:\workspace\node-test-binary-windows\) and haven't seen this happen since in CI, but this is still an issue.

Just now: https://ci.nodejs.org/job/node-test-binary-windows/COMPILED_BY=vcbt2015,RUNNER=win10,RUN_SUBSET=3/12350/console

Also I vaguely remember that I have seen this on my macbook, but it does not reproduce much.

Happened twice in a row on Windows 10 in CI:

https://ci.nodejs.org/job/node-test-binary-windows/12364/COMPILED_BY=vcbt2015,RUNNER=win10,RUN_SUBSET=3/console
https://ci.nodejs.org/job/node-test-binary-windows/12365/COMPILED_BY=vcbt2015,RUNNER=win10,RUN_SUBSET=3/console:

not ok 36 parallel/test-async-wrap-uncaughtexception
  ---
  duration_ms: 0.188
  severity: fail
  stack: |-
    Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
        at Object.exports.mustCall (c:\workspace\node-test-binary-windows\test\common\index.js:501:10)
        at Object.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-async-wrap-uncaughtexception.js:14:33)
        at Module._compile (module.js:617:30)
        at Object.Module._extensions..js (module.js:628:10)
        at Module.load (module.js:536:32)
        at tryModuleLoad (module.js:479:12)
        at Function.Module._load (module.js:471:3)
        at Function.Module.runMain (module.js:658:10)
        at startup (bootstrap_node.js:191:16)

Can anything there be scheduling more async work?

process.removeAllListeners() emits removeListener events, so that may be scheduling more work, maybe?

If we're pretty sure the being-emitted-twice thing is not actually a bug but dependent on the state of the machine, easiest fix might be to use process.once() instead. I'll submit a PR now-ish...

Seems like we now have a repro - https://ci.nodejs.org/job/node-stress-single-test/1475/
(Ubuntu1604x64).
All the above clues makes me thing this is a memory corruption bug...

@Trott that commit landed 6 days ago while this issue was opened 18 days ago...I think there could be more to it, might need to audit some crypto error PRs

@joyeecheung Urm...hmmm... Maybe that commit caused the failures to happen on ubuntu 1604 which only started failing more recently? Maybe the bug has deeper roots for Windows?

I did the stress testing on Ubuntu so it would go faster, but I guess I may need to take a deep breath, set aside several hours, and do it on Windows....

@Trott I looked into e567402 again and AFAICT it does not seem to touch the code path tested in test-async-wrap-uncaughtexception. Maybe we could switch to another async hooks provider in the test to see if the root is in crypto?

Hmmm....after 12 hours of nothing but red (presumably due to this issue), Windows is coming back green/yellow/red now...

The constant failures on ubuntu1604-64 stopped too. 🤔

There was a commit that landed around that time that (incidentally) removed the text from the throw Error() in this test...

Argh, nope, still a thing, now showing up on fedora24:

https://ci.nodejs.org/job/node-test-commit-linux/13748/nodes=fedora24/console

  ...
not ok 136 parallel/test-async-wrap-uncaughtexception
  ---
  duration_ms: 0.138
  severity: fail
  stack: |-
    Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
        at Object.exports.mustCall (/home/iojs/build/workspace/node-test-commit-linux/nodes/fedora24/test/common/index.js:490:10)
        at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux/nodes/fedora24/test/parallel/test-async-wrap-uncaughtexception.js:14:33)
        at Module._compile (module.js:641:30)
        at Object.Module._extensions..js (module.js:652:10)
        at Module.load (module.js:560:32)
        at tryModuleLoad (module.js:503:12)
        at Function.Module._load (module.js:495:3)
        at Function.Module.runMain (module.js:682:10)
        at startup (bootstrap_node.js:191:16)

BTW I have seen this a few times this week on my macbook, but have never been able to reproduce reliably. It only shows up when I run the tests in parallel.

I'm worried about this one at Code + Learn next week because I can totally imagine a third of the room having the test fail locally or something like that....

I'm worried about this one at Code + Learn next week because I can totally imagine a third of the room having the test fail locally or something like that....

Alternative: Marking as flaky https://github.com/nodejs/node/pull/16694 PR

@jasnell RE: #16692 we see either a regression between 8.6.0 and 8.7.0 or some level of non-determinism, so IMHO we shouldn't force the test to pass and so hide the bug

I created a new project to track flakes on CI and put a variety of related issues to this test in a column. Please feel free to add more related issues / prs

FWIW, moving this test to sequential does not solve the problem (tried it in https://github.com/nodejs/node/pull/16733) so having too much other stuff happening on the system is unlikely to be the problem.

I can replicate this if I use the inspector and step into or out of stuff.

I'm starting to suspect that the extra beforeExit call might be caused either by the changes starting in the block starting on line 4708 of src/node.cc in https://github.com/nodejs/node/commit/f27b5e4bdaa or (more likely, perhaps?) the V8 changes that commit is trying to accommodate. Haven't been able to pin it down yet, but if a C++ person wants to do some debugging, please do. :-D (And, of course, I could be totally wrong in my suspicion, just like I've been a multiple times already with this issue.)

@trott The only V8 functionality I know of that will require the change is asynchronous WASM compilation, which is currently behind a flag even in V8 master.

It might be related to f27b5e4, but I’d be surprised since all I can see V8 doing when the test is being executed is scheduling background tasks, no foreground or delayed foreground tasks…

This has been fixed and can be closed now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

stevenvachon picture stevenvachon  Â·  3Comments

seishun picture seishun  Â·  3Comments

sandeepks1 picture sandeepks1  Â·  3Comments

vsemozhetbyt picture vsemozhetbyt  Â·  3Comments

cong88 picture cong88  Â·  3Comments