Version: 10.16.3
Platform: 4.15.0-1044-aws #46-Ubuntu SMP Thu Jul 4 13:38:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Subsystem: http2
I'm reusing a session for performance reasons and opening 10 new requests to a server per second. On node 10.16.2 it runs forever with no problems. On 10.16.3 it starts throwing "Error [ERR_HTTP2_STREAM_ERROR]: Stream closed with error code NGHTTP2_ENHANCE_YOUR_CALM" after a minute or two +- 1000 requests.
Is this a newly introduced bug or how to prevent this?
Can you share any more information? Is the server the program running Node.js, is the client, or are both? What is the value of the maxSessionMemory option on either side (if it is set)? Does running node as node --security-reverts=CVE-2019-9514 help?
(See also https://github.com/nodejs/node/pull/29122#issuecomment-522642986 for some background.)
This is most likely related to the new denial of service attack mitigations. The revert workaround @addaleax suggests should make it work but this is something we're going to need to revisit.
@addaleax I'm not aware of the server site implementation. You can use this url to reproduce:
https://api.binance.com/api/v1/depth?symbol=BTCUSDT. maxSessionMemory was not set explicitly. Running node with --security-reverts=CVE-2019-9514 does not solve the issue.
Probably another commit did cause this issue.
We are experiencing a similar issue in the Google Cloud Firestore SDK. We have 10 or so customers reporting their production apps hitting this error where our http/2 stream is closed with NGHTTP2_ENHANCE_YOUR_CALM.
I can reproduce the issue with node v10.16.3 but not v10.16.2 so it seems to be a regression. Running node with --security-reverts=CVE-2019-9514 _does not_ help.
I've created a repro project here: https://github.com/mikelehen/node-http2-repro
Unfortunately it is _not_ a minimal repro, but it is what I was able to piece together based on our customer reports. Is it enough to debug the issue and isolate a root cause? We would very much appreciate a workaround and/or a fix to node so that we can unblock our customers. Thank you!
We managed to get a reproducible version of the bug we are experiencing. I can confirm that including the flags --security-reverts=CVE-2019-9514 --security-reverts=CVE-2019-9512 --security-reverts=CVE-2019-9516 --security-reverts=CVE-2019-9518 does not fix the issue.
I was able to bisect and identify this commit as the one causing the failure... @addaleax any thoughts?
/cc @nodejs/http2
Couple of quick tests... can you try setting the maxSessionMemory option significantly higher and seeing if the problem persists? There are a few other knobs there that can be turned also but let's try that one first. The default value for maxSessionMemory is 10, indicating a max of ten megabytes per session, which could be eaten away quickly with a large number of streams.
I was able to bisect and identify this commit as the one causing the failure... @addaleax any thoughts?
@MylesBorins Which commit?
The commit @MylesBorins meant to link is https://github.com/nodejs/node/pull/29122/commits/164ac5b241b96089e6bad5bb83ea416966b3245f . 😄
Okay, that’s interesting – I’ll take a look. It still would be good to know if turning the maxSessionMemory knob makes a difference here, as @jasnell suggested.
Based on that commit message, specifically "If we are waiting for the ability to send more output, we should not process more input", I think the overall goal of that commit is a bad idea. With that logic implemented on both sides of a connection, if one side ever pushes back on flow control for any reason, the other side will stop reading and start pushing back the other way, which will result in a deadlock where neither side will clear out their read buffer because they're waiting for the other side to do so.
And if just the client implements that logic, it's easy to imagine a scenario where the client doesn't read enough, resulting in the server being unable to write while still receiving more data. It wouldn't surprise me if some servers send ENHANCE_YOUR_CALM when that happens.
@addaleax @jasnell It looks like increasing maxSessionMemory _does_ solve the problem. If I set it to 100, my repro works. If I set it to 1 then it fails much faster.
A couple notes:
http2.connect() not http2.createServer().FYI- If you need to run my repro in https://github.com/mikelehen/node-http2-repro you'll need access to my project which I'll have to grant you out-of-band. But hopefully the info above may be enough to resolve the issue? Thanks!
@murgatroid99 ... the scenario you describe is entirely possible and worth exploring to see how much of an issue that is in our implementation. I'm thinking, however, that for this specific problem, it's more of a memory management issue. Specifically: we need to get better at it. The current maxSessionMemory limit was really just a finger in the air guesstimate and it's looking like we either need to (a) increase it or (b) find other ways of reducing the memory load per QuicSession.
@jasnell Any guesses why https://github.com/nodejs/node/commit/164ac5b241b96089e6bad5bb83ea416966b3245f would have exacerbated the memory management issue?
FWIW the more I increase maxSessionMemory, the longer my repro runs before failing, even though we are doing fixed-size requests and responses and only have 1 outstanding at a time. So I would expect the necessary sessionMemory to be pretty much flat. So I'm wondering if https://github.com/nodejs/node/commit/164ac5b241b96089e6bad5bb83ea416966b3245f has introduced a mismatch in IncrementCurrentSessionMemory() / DecrementCurrentSessionMemory() calls.
Like maybe https://github.com/nodejs/node/commit/164ac5b241b96089e6bad5bb83ea416966b3245f#diff-a2de252da810db89f574ecafa0f8e95aR1901 should instead be:
DecrementCurrentSessionMemory(stream_buf_.len - stream_buf_offset_);
(this is a wild guess. I don't really understand the code; I'm just trying to apply some pattern matching...)
Definitely sounds like a memory accounting issue then... or a straight up memory leak. That gives a good place to start the investigation. First place to check would be to rule out a memory leak, and if that looks good, then we can look at the accounting.
@jasnell Thanks for all your investigations so far.
I work with @mikelehen on Firestore and was wondering if you could let us know if you had an estimated timeline for a fix? We have a couple of users that are blocked on this. Thank you so much!
With my current workload, it won't be until later this week at the earliest, but definitely haven't forgotten!
In case it helps, I actually tried out my guess above (https://github.com/nodejs/node/issues/29223#issuecomment-537721898) last week and with that tweak, the error seems to go away. That said, I don't know how to verify that all the accounting is actually coming out correct. It's possible I just messed up the accounting in a different way.
Thank you for the update @jasnell! Let us know if you need help verifying a fix.
Hi, is there any news about this issue? Also could the fix be ported to NodeJS 12, as this version became the LTS version recently. Thanks for your efforts!
Any update here? Would it help if I sent a PR based on my guess in https://github.com/nodejs/node/issues/29223#issuecomment-537721898 ? I don't know the code well enough to know if it's correct or how to definitively validate / test it, though. :-)
I'v opened a PR (https://github.com/nodejs/node/pull/30684) that addresses the issue. 🤞 we can get it merged? 😄
Most helpful comment
Hi, is there any news about this issue? Also could the fix be ported to NodeJS 12, as this version became the LTS version recently. Thanks for your efforts!