Version: 12.13.0
Platform: Alpine 3.9 (Linux x86_64) on AWS ECS
Subsystem: domain
TypeError: Cannot read property 'enter' of undefined
at AsyncHook.before (domain.js:76:20)
at emitHook (internal/async_hooks.js:164:38)
We use @sentry/node
to log our errors in express
, which is using the domain
module here:
https://github.com/getsentry/sentry-javascript/blob/master/packages/node/src/handlers.ts#L271
So far it is extremely hard to reproduce but if we run our production server for around 10 minutes, our application will crash with the above error and no other stack trace. At first, we thought it was related to https://github.com/nodejs/node/issues/28275 but it appears to be different as it looks like the fix for this landed around 12.8.0
and we can confirm that this happens in 12.13.0
still.
Versions 12.2.0
or below do not have this issue.
I believe the actual line is 78, not 76 which can be seen here:
https://github.com/nodejs/node/blob/master/lib/domain.js#L78
EDIT: Thanks @richardlau, line 79 in v12.13.0 https://github.com/nodejs/node/blob/v12.13.0/lib/domain.js#L79
In 12.13.0 the line would appear to be 79:
https://github.com/nodejs/node/blob/v12.13.0/lib/domain.js#L79
@emhagman Is that all the stack trace you get? Is there any chance you could figure out what type of resource the hook is called for?
@addaleax This is the only stack trace that shows up, unfortunately. I can try to add a lot more logging and get some more context around when it happens but that is about it. Can I run node
with any useful DEBUG=
or command-line arguments that would help you out?
@emhagman Can you dump async hooks output to a file (Ă la https://github.com/addaleax/node/blob/ee4027d5bbabe87d4446a62ec0b7b60e7af59c30/test.js#L3-L18 except not /dev/stderr
) and make sure that hook is installed and active before any code that accesses domains? That might already be very helpful.
@addaleax Sure thing I'll get it to you as soon as I can. Thanks for the help!
Did Github delete a comment here? Anyway; just thinking out loud…
The debug log shows that there is no destroy
for the async id in question, so the C++ TLSWRAP
object should still be alive, and thus keep the domain object alive through its .domain
property; but the error message basically says that the domain has been garbage collected and is no longer accessible. The conclusion for me would be that the domain
property on the native TLSWrap
object is un-set at some point, but I really can’t think of a way that that would happen accidentally.
I’m not really sure how to best debug this further – one thing you could try is running node with --abort-on-uncaught-exception
and share the core dump here, if that’s an option for you… that should at least enable verifying whether the property is set at time of the crash or not.
@addaleax Sorry, I deleted the comment because I left the resource
out of the logs. I was trying to get a log for you that included the resource but I currently can't make the app crash again, it really is hard to reproduce. I am doing these tests in one of our staging environments so adding --abort-on-uncaught-exception
won't be a problem.
I'm not sure if it helps but we can anecdotally confirm (other users in that thread) that not using the @sentry/node
middleware removes the issue. I don't know if you saw the way they're using domain
and got any insight. They use it throughout the @sentry/*
ecosystem for Node.
@emhagman I’m not personally familiar with their code, although you can of course also report this as an issue to them if you haven’t already (and ideally add a link this one).
@addaleax I have and they have deemed it an issue with Node, unfortunately. Since this has to do with GC, are there any ways to force a situation where this is more likely to occur? As of now, I have no idea how to reproduce it other than to use our app until it breaks
@emhagman You could set the value of the --gc-interval=n
flag to a low value, if you want to cause more frequent GC; that might help?
@addaleax I have the logs with the resources. Adding --gc-interval=5
sped up the error occurring, thanks for the tip.
ASYNC_HOOK_DEBUG: init {
id: 47259,
type: 'TLSWRAP',
triggerId: 47235,
resource: ReusedHandle {
type: 43,
handle: TLSWrap {
_parent: [TCP],
_parentWrap: undefined,
_secureContext: [SecureContext],
reading: true,
onhandshakestart: [Function: noop],
onhandshakedone: [Function],
onocspresponse: [Function: onocspresponse],
onnewsession: [Function: onnewsessionclient],
onkeylog: [Function: onkeylogclient],
onerror: [Function: onerror],
[Symbol(owner)]: [TLSSocket]
}
}
}
ASYNC_HOOK_DEBUG: before { id: 47259 }
TypeError: Cannot read property 'enter' of undefined
at AsyncHook.before (domain.js:79:20)
at emitHook (internal/async_hooks.js:164:38)
Aborted (core dumped)
Not sure if that helps. Working on getting the dump from my container now
EDIT: We use ECS Fargate on AWS so that will be impossible. I'll have to try and reproduce this locally so I can get access to the dump.
@emhagman Alright, this makes a bit more sense now. The .domain
property isn’t actually set on the real resource, it’s set on a fake resource object used by the HTTP agent that can be garbage collected independently.
I’m assuming that 3d9d1ade2a361f408b116c5bafb2fcd560310f9c is responsible for this … any chance you could verify that the bug was introduced in Node v12.3.0?
Also, if it helps with reproducing: If i’m correct, this will mostly happen when the https.Agent
re-uses sockets for multiple requests to the same host.
@addaleax I believe others have already reproduced this bug in 12.3.0 and mentioned reverting to 12.2.0 fixes it in the Sentry issue thread. If I can make this happen on demand I'll be able to be positive about that.
What you said about the https agent makes perfect sense, we have a client that uses the same https agent with keepAlive on to the same host to try to save on connections to one of our internal services. That definitely helps with me getting closer to a reproducible bug. I'll keep digging, thanks again for the help.
@emhagman If you’re in a position to try out patches to Node.js, could you try this one?
diff --git a/lib/_http_agent.js b/lib/_http_agent.js
index dcb5ed376de8..f4d9bb7fe36a 100644
--- a/lib/_http_agent.js
+++ b/lib/_http_agent.js
@@ -44,10 +44,15 @@ const {
// ClientRequest.onSocket(). The Agent is now *strictly*
// concerned with managing a connection pool.
+const kReusedHandle = Symbol('kReusedHandle');
+
class ReusedHandle {
constructor(type, handle) {
this.type = type;
this.handle = handle;
+ // Tie the lifetime of the two objects together, mostly for the 'domain'
+ // module.
+ handle[kReusedHandle] = this;
}
}
As soon as I can reproduce it easily I'll try the patch 👍
It appears the library in question also uses httpsAgent
with keepAlive: true
so I think you're onto something! https://github.com/getsentry/sentry-javascript/blob/edd08f9fa4e4c387ed9c3ba2500119c78e67be15/packages/raven-node/lib/transports.js#L10
Just come here after just adding sentry to my dependencies.
Here you have multiple userland stack traces I'm seeing;
Hope this helps
TypeError: domain.enter is not a function
at IncomingMessage.EventEmitter.emit (domain.js:498:10)
at emitReadable_ (_stream_readable.js:574:12)
at processTicksAndRejections (internal/process/task_queues.js:79:21)
TypeError: domain.enter is not a function
at IncomingMessage.EventEmitter.emit (domain.js:498:10)
at resOnFinish (_http_server.js:646:7)
at ServerResponse.emit (events.js:215:7)
at ServerResponse.EventEmitter.emit (domain.js:476:20)
at onFinish (_http_outgoing.js:686:10)
at onCorkedFinish (_stream_writable.js:690:5)
at afterWrite (_stream_writable.js:501:3)
at processTicksAndRejections (internal/process/task_queues.js:81:21)
md5-f38a9ce75cc0dfc6d811551d302395d3
TypeError: domain.enter is not a function
at IncomingMessage.EventEmitter.emit (domain.js:498:10)
at endReadableNT (_stream_readable.js:1183:12)
at processTicksAndRejections (internal/process/task_queues.js:80:21)
at runNextTicks (internal/process/task_queues.js:62:3)
at processImmediate (internal/timers.js:412:9)
at process.topLevelDomainCallback (domain.js:131:23)
@jmendiara That seems like an unrelated issue… if you have reason to believe it’s related to Node.js core, please open a separate issue, and if you can, use chrome devtools or some other debugging approach to figure out what kind of object domain
is at the call sites. (I’ll hide your comment and this one so it’s not read as related to this issue.)
@addaleax I am still trying to reproduce it but I noticed I linked to old code. They currently use https://github.com/getsentry/sentry-javascript/blob/master/packages/node/src/transports/https.ts#L18 at Sentry. The keepAlive: false
seems interesting as well as the timeout: 2000
.
Does that help you at all?
I'm following this issue as well as I'm really interested in solving it, just one thing about the keepalive flag and the httpsagent. Ist the agent also responsible for regular requests in express? I'm asking because the domain module is used in the Middleware of express, and as far as I understand the sentry client is using the http agent also for transferring the results back the the sentry backend, but in this case the domain is not that relevant. It is in the middleware though ... Assuming this is correctl, the "right way" of reproducing the issue would be sending parallel requests to your server so that the agent gets reused after some calls.
Not sure though,just my 2 cents.
@simllll That particular agent is not responsible for regular requests in express
. Correct me if I am wrong @addaleax, but this actually has nothing to do with Sentry using domain
in their own codebase but actually has to do with their use of httpsAgent
to send requests to their server.
Here is my current code I am using to try to reproduce the issue. Using --gc-interval=5
and still can't seem to get the error.
const https = require('https');
const fs = require('fs');
const util = require('util');
const crypto = require('crypto');
const async_hooks = require('async_hooks');
// async_hooks debug dump code...
function debug(...args) {
fs.writeFileSync('/dev/stdout', `ASYNC_HOOK_DEBUG: ${util.format(...args)}\n`, { flag: 'a' });
}
async_hooks.createHook({
init(id, type, triggerId, resource) {
debug('init', { id, type, triggerId, resource });
},
before(id) {
debug('before', { id });
},
after(id) {
debug('after', { id });
},
destroy(id) {
debug('destroy', { id });
},
}).enable();
const httpsAgent = new https.Agent({ keepAlive: false, maxSockets: 5, timeout: 2000 });
const httpsOpts = {
protocol: 'https:',
method: 'POST',
hostname: 'postman-echo.com',
path: '/post',
port: 443,
agent: httpsAgent
};
function imitateSentrySend(httpModule, event) {
return new Promise((resolve, reject) => {
const opts = Object.assign(httpsOpts, event.options);
const req = https.request(opts, function(res) {
res.setEncoding('utf8');
if (res.statusCode >= 200 && res.statusCode < 300) {
resolve({ status: res.statusCode });
} else {
reject(new Error(`Request failed with statusCode: ${res.statusCode}`));
}
res.on('data', () => {});
res.on('end', () => {});
});
req.on('error', reject);
req.end(JSON.stringify(event));
});
}
(async function main() {
while (true) {
const id = crypto.randomBytes(8).toString('hex');
const event = { id, msg: 'Some random error has occurred, sending to Sentry...' };
const fakeArray = Array(Math.ceil(Math.random() * 4)).fill(Math.random());
const promises = fakeArray.map(value => {
const delay = Math.round(Math.random() * 2);
const opts = delay ? { path: '/delay/' + delay, method: 'GET' } : {};
return imitateSentrySend(https, { ...event, value, options: opts });
});
await Promise.all(promises);
}
})();
Correct me if I am wrong @addaleax, but this actually has nothing to do with Sentry using
domain
in their own codebase but actually has to do with their use ofhttpsAgent
to send requests to their server.
@emhagman I think it’s the combination of both that’s making this problematic.
Here is my current code I am using to try to reproduce the issue. Using --gc-interval=5 and still can't seem to get the error.
Yeah, I don’t think this is going to crash the way it does in your production code without using domains…
I'm following this issue as well as I'm really interested in solving it, just one thing about the keepalive flag and the httpsagent. Ist the agent also responsible for regular requests in express?
@simllll I don’t know for sure, but I wouldn’t think that it is being used for those.
@addaleax Gotcha. Do I need to bind anything in particular or just be running this inside my own domain?
const debugDomain = domain.create();
debugDomain.on('error', err => {
console.log('Caught error in domain', err);
});
(async function main() {
debugDomain.run(async () => {
const id = crypto.randomBytes(8).toString('hex');
const event = { id, msg: 'Some random error has occurred, sending to Sentry...' };
const delay = Math.round(Math.random() * 4);
const opts = delay ? { path: '/delay/' + delay, method: 'GET' } : {};
await imitateSentrySend(https, { ...event, options: opts });
await main(); // run-again
});
})();
Sorry if I seem ignorant, I have never really worked with domains directly before. My basic understanding is they're used to hold context and to prevent errors from bubbling up to the main event loop.
Do I need to bind anything in particular or just be running this inside my own domain?
@emhagman I would think that that’s enough but that doesn’t really mean that I’m sure that it’ll work as a reproduction. (I’ll try to put one together myself, too – likely tomorrow though.)
My basic understanding is they're used to hold context and to prevent errors from bubbling up to the main event loop.
Yeah, that was kind of the original idea behind them when they were introduced.
@emhagman I'm only on my phone, but my thoughts would be to send the error (your imitatesentrysend) in the on error handler of the domain, and inside the run async method just throw a lot of errors with some context (e.g. create objects that you attach to the error object). They should get picked up by the error handler, where the agent will try to send the data somewhere. I also would think of a way to make the transfer to the backend as slow as possible so that gc can kick in while sending the data. Maybe try attach some megabytes of Data to the error object so that it will take a while ;)
@simllll Thanks for the tips. I am probably going to take a break for a bit. This is what I currently have if anyone wants to take a stab at it. I've set it up exactly as Sentry, with the request middleware and the error middleware and used a local domain just like they do in their request middleware. It then "sends the errors" to the echo service to imitate Sentry.
I've used the same maxSockets
value, timeout
value, and keepAlive
value that they do in their code.
With all of this, and using --gc-interval=5
, I still can't seem to reproduce the error. @addaleax If I can get a core dump to prove it, would that be enough? It may be painful but I can at least eventually reproduce it in our production environment and I can figure a way out to get the core dump.
Run the app with node --gc-interval=5 app.js
and then run node request.js
to test it out.
app.js
const express = require('express');
const https = require('https');
const fs = require('fs');
const util = require('util');
const crypto = require('crypto');
const async_hooks = require('async_hooks');
const domain = require('domain');
// async_hooks debug dump code...
function debug(...args) {
fs.writeFileSync('/dev/stdout', `ASYNC_HOOK_DEBUG: ${util.format(...args)}\n`, { flag: 'a' });
}
async_hooks.createHook({
init(id, type, triggerId, resource) {
debug('init', { id, type, triggerId, resource });
},
before(id) {
debug('before', { id });
},
after(id) {
debug('after', { id });
},
destroy(id) {
debug('destroy', { id });
},
}).enable();
const httpsAgent = new https.Agent({ keepAlive: false, maxSockets: 30, timeout: 2000 });
const httpsOpts = {
protocol: 'https:',
method: 'POST',
hostname: 'postman-echo.com',
path: '/post',
port: 443,
agent: httpsAgent
};
function imitateSentrySend(httpModule, event) {
return new Promise((resolve, reject) => {
const opts = Object.assign(httpsOpts, event.options);
const req = https.request(opts, function(res) {
res.setEncoding('utf8');
if (res.statusCode >= 200 && res.statusCode < 300) {
resolve({ status: res.statusCode });
} else {
reject(new Error(`Request failed with statusCode: ${res.statusCode}`));
}
res.on('data', () => {});
res.on('end', () => {});
});
req.on('error', reject);
req.end(JSON.stringify(event));
});
}
const app = express();
// Fake Sentry RequestHandler (binds domain)
app.use(function(req, res, next) {
const local = domain.create();
local.bind(req);
local.bind(res);
local.on('error', next);
local.run(() => {
next();
});
});
// Normal express route
app.get('/debug-sentry', function(req, res) {
const error = new Error('Random error');
error.randomData = crypto.randomBytes(1024 * 1024); // 1mb
throw error;
});
// Fake ErrorHandler, always sending event
app.use(function(err, req, res, next) {
if (err) {
const id = crypto.randomBytes(8).toString('hex');
const event = { id, msg: 'Some random error has occurred, sending to Sentry...' };
imitateSentrySend(https, { ...event, data: err.randomData }); // they don't await in their code either
next(err);
} else {
next();
}
});
app.listen(9999);
request.js
const http = require('http');
function r() {
return new Promise((resolve, reject) => {
const req = http.get('http://localhost:9999/debug-sentry', (resp) => resolve(resp));
req.on('error', reject);
});
}
(async function main() {
while (true) {
const promises = Array(100).fill(0).map(() => r());
try {
await Promise.all(promises);
} catch (err) {
console.log('Failed to resolve promises', err);
}
}
})();
@addaleax If I can get a core dump to prove it, would that be enough? It may be painful but I can at least eventually reproduce it in our production environment and I can figure a way out to get the core dump.
Fwiw, I’m somewhat optimistic about being able to put together a reproduction myself, and then using this as a regression test. In the worst case, we could still apply something like the patch above and hope for the best, although it would of course be nice to actually confirm that it does fix a real issue.
@addaleax Sounds good! I didn't know if you needed definitive proof to able to put it into the codebase which is why I was asking. I know I keep saying it but thanks for the help!
This is the only thing preventing us from upgrading to the higher number 12.X and we'd love to be able to use the cgroups
memory setting that landed in 12.7.0 https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V12.md#2019-07-23-version-1270-current-targos since we use Docker extensively and also take advantage of the startup time improvements from the V8 snapshots.
@saniagh Ran into this issue without Sentry. Of note, we both use node:alpine
on Docker.
@saniagh Ran into this issue without Sentry. Of note, we both use
node:alpine
on Docker.
Indeed, I've also checked to see if any module uses the https.Agent
as @emhagman suggested and I have found none.
I've upgraded our version of node
to v13.0.1 in development to try to reproduce the bug. ( I chose to match our production version )
I'll be back if I get something.
Alright! Put together a fix and a regression test: https://github.com/nodejs/node/pull/30196
@emhagman Thanks for the bug report and working this out with me!
@addaleax Thanks so much đź’Ż I appreciate you taking the time to look into and fix it. Your test code is very helpful in learning how to debug/reproduce things like this in the future. Thanks!
EDIT: Any idea on how long this will take to get cut into a release for 12.X?
EDIT: Any idea on how long this will take to get cut into a release for 12.X?
@emhagman So, typically the way this works is that the PR takes at least 48 hours to land unless explicitly being fast-tracked (I don’t think this qualifies as a trivial change), plus the time to the next Current/13.x release (Tuesday according to https://github.com/nodejs/Release/issues/487); and then the LTS rules says that commits need to have been released two weeks before being backported into LTS, but if there’s a strong case for doing so sooner, that’s usually not an issue for a low-risk patch like this one.
@addaleax Got it, thanks 👍
Awesome job @addaleax, thank you! :)
We still see the issue with node:12.14.1 (and above), @sentry/node and express
TypeError: Cannot read property 'enter' of undefined
at AsyncHook.before (domain.js:78:20)
at emitHook (internal/async_hooks.js:168:38)
Most helpful comment
Alright! Put together a fix and a regression test: https://github.com/nodejs/node/pull/30196
@emhagman Thanks for the bug report and working this out with me!