The change in https://github.com/nodejs/node/pull/30965 means that Node now relies on process._fatalException
calling emitAfter(asyncId);
However, this cannot be done in userland code, because the emitAfter
method is not exposed.
With the deprecation of domains in Node.js version 4, as far as I've been able to tell, patching process._fatalException
and tracking the process flow through async hooks is the _only_ way to reliably tie an error back to a chain of continuations. (And it doesn't even work 100% reliably, because Promises break async hooks.)
For node-tap, in order to fail the appropriate test when an error is thrown or promise is rejected (most of the time, at least), I'm using async-hook-domain. I used to use domains, but was scared off by the noisy warnings.
Now, a test like this:
const t = require('tap')
t.test('failure', t => {
throw new Error('nope')
})
which should output this:
$ node fail.js
TAP version 13
# Subtest: failure
not ok 1 - nope
---
stack: |
Test.<anonymous> (fail.js:3:9)
at:
line: 3
column: 9
file: fail.js
function: Test.<anonymous>
tapCaught: testFunctionThrow
test: failure
source: |
t.test('failure', t => {
throw new Error('nope')
--------^
})
...
1..1
# failed 1 test
not ok 1 - failure # time=18.611ms
instead (as of 503900b4633a541ecbebc159487f775c2669f54d) outputs this:
$ node fail.js
TAP version 13
# Subtest: failure
not ok 1 - nope
---
stack: |
Test.<anonymous> (fail.js:3:9)
at:
line: 3
column: 9
file: fail.js
function: Test.<anonymous>
tapCaught: testFunctionThrow
test: failure
source: |
t.test('failure', t => {
throw new Error('nope')
--------^
})
...
1..1
# failed 1 test
not ok 1 - failure # time=18.611ms
Error: async hook stack has become corrupted (actual: 18, expected: 1)
1: 0x10000b8d9 node::AsyncHooks::pop_async_id(double) [/usr/local/bin/node]
2: 0x10000189a node::InternalCallbackScope::Close() [/usr/local/bin/node]
3: 0x10000149a node::InternalCallbackScope::~InternalCallbackScope() [/usr/local/bin/node]
4: 0x1000b3f45 node::NodeMainInstance::Run() [/usr/local/bin/node]
5: 0x10005e46f node::Start(int, char**) [/usr/local/bin/node]
6: 0x7fff657932e5 start [/usr/lib/system/libdyld.dylib]
7: 0x2
What has to happen to get error handling and continuation tracking shipped in core? The pieces are all there, they're just not exposed in any kind of useful way. I'm happy to help.
Failing that, I'd suggest reverting 503900b4633a541ecbebc159487f775c2669f54d.
@apapirovski
@nodejs/async_hooks
i can repro this
Hrmmm... this is problematic. Patching _fatalException
in this way also means any errors in nextTick & such will result in the same async stack corruption.
The problem is that the expectation was ever set that this is a valid behaviour, since it just isn't.
A couple of options:
Revert, but then the reproduction instead becomes having your test throw in a nextTick. It's still wrong and broken.
Rip out async hooks resolution out of _fatalException and instead have a wrapper that isn't monkey patchable and calls _fatalException. That will preserve the current monkey-patchability and let async hooks behavior resolve as intended without interference. That's ultimately likely the more correct solution.
If there are no objections for 2, I'll get something coded up tomorrow.
FWIW, having tap
in CITGM probably would have caught this problem. (npm ci && npm test
on a fresh checkout of the node-tap repo passes with 13.4 and fails with 13.5.) @nodejs/citgm
Hrmmm... this is problematic. Patching _fatalException in this way also means any errors in nextTick & such will result in the same async stack corruption.
This is easy to work around: https://github.com/tapjs/async-hook-domain/blob/master/index.js#L153-L159
And it works today: https://gist.github.com/isaacs/3cc8394c351ead9d208748037b338764
Maybe I'm not sure I understand what you mean?
The problem is that the expectation was ever set that this is a valid behaviour, since it just isn't.
"Valid" is a value judgement, and I will freely admit that async-hook-domain is a dirty ugly gross hack that belongs in core rather than userland. If you want me to feel bad about it, fine, I feel bad. But unfortunately, there is no other way to track errors through continuations without triggering deprecation warnings or catching _all_ errors (even those that should not be caught). The state of error handling hasn't changed since I brought this up 2 years ago.
There is a comment from @joyeecheung about maintaining the patchability of process._fatalException
, and noting that investigation should be done to determine whether process._fatalException
can be deprecated. To my knowledge, that investigation has not been done, and until there is some other way to track errors through continuations, I think it's clear that no, it can't be deprecated.
Rip out async hooks resolution out of _fatalException and instead have a wrapper that isn't monkey patchable and calls _fatalException. That will preserve the current monkey-patchability and let async hooks behavior resolve as intended without interference. That's ultimately likely the more correct solution.
If you ping me on the PR or post a link here, I'd be happy to review and make sure it un-breaks async-hook-domain and tap.
@Trott what has to happen to get tap in citgm?
Also: can we pull async-hook-domain or something like it into core? Who needs to be in that conversation?
Just to be super clear: I'd _truly love_ to put require('domains')
and process._fatalException
and async-hook-domain
in the trash bin forever, but not until there's something to replace them with and a reasonable transition plan to get there.
what has to happen to get tap in citgm?
PRed into https://github.com/nodejs/citgm/blob/master/lib/lookup.json
Perhaps async-hook-domain should be added in as well?
@nodejs/citgm
@sam-github I think adding ahd would be more annoying than useful. Its tests are pretty frequently broken by perfectly innocuous wording changes in error messages, but it's usually pretty easy to fix. Testing tap is a better integration canary anyway.
Haven't forgotten about this btw. Slightly bigger change but should have it done soon.
This is easy to work around
Sure, but that workaround was required because that also caused a similar async stack corruption.
@apapirovski Any update on this? It's really making it hard to run my tests on node v13, because thrown errors are not handled properly.
libtap is being split from tap. This specific issue is caught by running the libtap tests against node.js 13.5.0+, once tap uses libtap many tests will be removed (moved to libtap). I spoke with @isaacs, we think that it would be useful to also include tap itself as this could catch issues with process spawning and/or stdio handling.
I haven't had a chance to actually look at CITGM, hopefully I'll be able to make time soon.
@isaacs How are you doing with the latest versions of Node.js 12 & 13? https://github.com/nodejs/node/pull/31784 basically reverted the PR you mentioned above.
@addaleax I just retested libtap and node-tap in latest node.js 12 and 13, neither have this crash. Note I never actually saw the crash in any version of node.js 12 (maybe the bug was in 12.16.0 only and I missed it).
It looks like this is fixed in the latest v14 release! Thanks! 馃コ
If there's some other reason why this ticket is left open other than just forgetting to close it, feel free to re-open. But I'm satisfied, thank you very much for getting this done before the stable release :)
Most helpful comment
A couple of options:
Revert, but then the reproduction instead becomes having your test throw in a nextTick. It's still wrong and broken.
Rip out async hooks resolution out of _fatalException and instead have a wrapper that isn't monkey patchable and calls _fatalException. That will preserve the current monkey-patchability and let async hooks behavior resolve as intended without interference. That's ultimately likely the more correct solution.
If there are no objections for 2, I'll get something coded up tomorrow.