Node: 13.5 broke tap's error handling

Created on 31 Dec 2019 · 17Comments · Source: nodejs/node

Version: 13.5.0
Platform: Darwin tau.local 19.0.0 Darwin Kernel Version 19.0.0: Thu Oct 17 16:17:15 PDT 2019; root:xnu-6153.41.3~29/RELEASE_X86_64 x86_64 (but irrelevant to this issue)
Subsystem: async hooks

The change in https://github.com/nodejs/node/pull/30965 means that Node now relies on process._fatalException calling emitAfter(asyncId);

However, this cannot be done in userland code, because the emitAfter method is not exposed.

With the deprecation of domains in Node.js version 4, as far as I've been able to tell, patching process._fatalException and tracking the process flow through async hooks is the _only_ way to reliably tie an error back to a chain of continuations. (And it doesn't even work 100% reliably, because Promises break async hooks.)

For node-tap, in order to fail the appropriate test when an error is thrown or promise is rejected (most of the time, at least), I'm using async-hook-domain. I used to use domains, but was scared off by the noisy warnings.

Now, a test like this:

const t = require('tap')
t.test('failure', t => {
  throw new Error('nope')
})

which should output this:

$ node fail.js
TAP version 13
# Subtest: failure
    not ok 1 - nope
      ---
      stack: |
        Test.<anonymous> (fail.js:3:9)
      at:
        line: 3
        column: 9
        file: fail.js
        function: Test.<anonymous>
      tapCaught: testFunctionThrow
      test: failure
      source: |
        t.test('failure', t => {
          throw new Error('nope')
        --------^
        })
      ...

    1..1
    # failed 1 test
not ok 1 - failure # time=18.611ms

instead (as of 503900b4633a541ecbebc159487f775c2669f54d) outputs this:

$ node fail.js
TAP version 13
# Subtest: failure
    not ok 1 - nope
      ---
      stack: |
        Test.<anonymous> (fail.js:3:9)
      at:
        line: 3
        column: 9
        file: fail.js
        function: Test.<anonymous>
      tapCaught: testFunctionThrow
      test: failure
      source: |
        t.test('failure', t => {
          throw new Error('nope')
        --------^
        })
      ...

    1..1
    # failed 1 test
not ok 1 - failure # time=18.611ms

Error: async hook stack has become corrupted (actual: 18, expected: 1)
 1: 0x10000b8d9 node::AsyncHooks::pop_async_id(double) [/usr/local/bin/node]
 2: 0x10000189a node::InternalCallbackScope::Close() [/usr/local/bin/node]
 3: 0x10000149a node::InternalCallbackScope::~InternalCallbackScope() [/usr/local/bin/node]
 4: 0x1000b3f45 node::NodeMainInstance::Run() [/usr/local/bin/node]
 5: 0x10005e46f node::Start(int, char**) [/usr/local/bin/node]
 6: 0x7fff657932e5 start [/usr/lib/system/libdyld.dylib]
 7: 0x2

What has to happen to get error handling and continuation tracking shipped in core? The pieces are all there, they're just not exposed in any kind of useful way. I'm happy to help.

Failing that, I'd suggest reverting 503900b4633a541ecbebc159487f775c2669f54d.

async_hooks confirmed-bug

Source

isaacs

Most helpful comment

A couple of options:

Revert, but then the reproduction instead becomes having your test throw in a nextTick. It's still wrong and broken.
Rip out async hooks resolution out of _fatalException and instead have a wrapper that isn't monkey patchable and calls _fatalException. That will preserve the current monkey-patchability and let async hooks behavior resolve as intended without interference. That's ultimately likely the more correct solution.

If there are no objections for 2, I'll get something coded up tomorrow.

apapirovski on 31 Dec 2019

👍3

All 17 comments

@apapirovski

Trott on 31 Dec 2019

@nodejs/async_hooks

Trott on 31 Dec 2019

i can repro this

devsnek on 31 Dec 2019

Hrmmm... this is problematic. Patching _fatalException in this way also means any errors in nextTick & such will result in the same async stack corruption.

The problem is that the expectation was ever set that this is a valid behaviour, since it just isn't.

apapirovski on 31 Dec 2019

A couple of options:

Revert, but then the reproduction instead becomes having your test throw in a nextTick. It's still wrong and broken.
Rip out async hooks resolution out of _fatalException and instead have a wrapper that isn't monkey patchable and calls _fatalException. That will preserve the current monkey-patchability and let async hooks behavior resolve as intended without interference. That's ultimately likely the more correct solution.

If there are no objections for 2, I'll get something coded up tomorrow.

apapirovski on 31 Dec 2019

👍3

FWIW, having tap in CITGM probably would have caught this problem. (npm ci && npm test on a fresh checkout of the node-tap repo passes with 13.4 and fails with 13.5.) @nodejs/citgm

Trott on 31 Dec 2019

Hrmmm... this is problematic. Patching _fatalException in this way also means any errors in nextTick & such will result in the same async stack corruption.

This is easy to work around: https://github.com/tapjs/async-hook-domain/blob/master/index.js#L153-L159

And it works today: https://gist.github.com/isaacs/3cc8394c351ead9d208748037b338764

Maybe I'm not sure I understand what you mean?

The problem is that the expectation was ever set that this is a valid behaviour, since it just isn't.

"Valid" is a value judgement, and I will freely admit that async-hook-domain is a dirty ugly gross hack that belongs in core rather than userland. If you want me to feel bad about it, fine, I feel bad. But unfortunately, there is no other way to track errors through continuations without triggering deprecation warnings or catching _all_ errors (even those that should not be caught). The state of error handling hasn't changed since I brought this up 2 years ago.

There is a comment from @joyeecheung about maintaining the patchability of process._fatalException, and noting that investigation should be done to determine whether process._fatalException can be deprecated. To my knowledge, that investigation has not been done, and until there is some other way to track errors through continuations, I think it's clear that no, it can't be deprecated.

Rip out async hooks resolution out of _fatalException and instead have a wrapper that isn't monkey patchable and calls _fatalException. That will preserve the current monkey-patchability and let async hooks behavior resolve as intended without interference. That's ultimately likely the more correct solution.

If you ping me on the PR or post a link here, I'd be happy to review and make sure it un-breaks async-hook-domain and tap.

@Trott what has to happen to get tap in citgm?

Also: can we pull async-hook-domain or something like it into core? Who needs to be in that conversation?

isaacs on 31 Dec 2019

Just to be super clear: I'd _truly love_ to put require('domains') and process._fatalException and async-hook-domain in the trash bin forever, but not until there's something to replace them with and a reasonable transition plan to get there.

isaacs on 31 Dec 2019

👍1

what has to happen to get tap in citgm?

PRed into https://github.com/nodejs/citgm/blob/master/lib/lookup.json

Perhaps async-hook-domain should be added in as well?

sam-github on 31 Dec 2019

@nodejs/citgm

sam-github on 31 Dec 2019

@sam-github I think adding ahd would be more annoying than useful. Its tests are pretty frequently broken by perfectly innocuous wording changes in error messages, but it's usually pretty easy to fix. Testing tap is a better integration canary anyway.

isaacs on 2 Jan 2020

Haven't forgotten about this btw. Slightly bigger change but should have it done soon.

This is easy to work around

Sure, but that workaround was required because that also caused a similar async stack corruption.

apapirovski on 8 Jan 2020

👍2

@apapirovski Any update on this? It's really making it hard to run my tests on node v13, because thrown errors are not handled properly.

isaacs on 11 Feb 2020

libtap is being split from tap. This specific issue is caught by running the libtap tests against node.js 13.5.0+, once tap uses libtap many tests will be removed (moved to libtap). I spoke with @isaacs, we think that it would be useful to also include tap itself as this could catch issues with process spawning and/or stdio handling.

I haven't had a chance to actually look at CITGM, hopefully I'll be able to make time soon.

coreyfarrell on 12 Feb 2020

👍1

@isaacs How are you doing with the latest versions of Node.js 12 & 13? https://github.com/nodejs/node/pull/31784 basically reverted the PR you mentioned above.

addaleax on 20 Feb 2020

@addaleax I just retested libtap and node-tap in latest node.js 12 and 13, neither have this crash. Note I never actually saw the crash in any version of node.js 12 (maybe the bug was in 12.16.0 only and I missed it).

coreyfarrell on 20 Feb 2020

It looks like this is fixed in the latest v14 release! Thanks! 🥳

If there's some other reason why this ticket is left open other than just forgetting to close it, feel free to re-open. But I'm satisfied, thank you very much for getting this done before the stable release :)

isaacs on 7 May 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings