Hi,
We updated the node version from 10.15.0 to 10.15.3 for a service which runs behind the AWS Application Load Balancer. After that our test suite revealed an issue which we didn't see before an update which results in HTTP 502 errors thrown by the load balancer. Previously, this was happening if the Node.js server closed a connection before the load balancer. We solved this by setting server.keepAliveTimeout = X
where X is higher than the keep-alive timeout on the load balancer side.
With version 10.15.3 setting server.keepAliveTimeout = X
does not work anymore and we see regular 502 errors by the load balancer. I have checked the changelog for Node.js, and it seems that there was a change related to keep-alive connection in 10.15.2 https://github.com/nodejs/node/commit/1a7302bd48 which might have caused the issue we are seeing.
Does anyone know if the mentioned change can cause the issue we are seeing? In particular, I believe the problem is that the connection is closed before the specified keep-alive timeout.
// cc @nodejs/http
The slowloris mitigations only apply to the HTTP headers parsing stage. Past that stage the normal timeouts apply (barring bugs, of course.)
Is it an option for you to try out 10.15.1 and 10.15.2, to see if they exhibit the same behavior?
In our test suite, there are about 250 HTTP requests. I have run the test suite four times for each of the following node versions 10.15.0, 10.15.1, 10.15.2. For 10.15.0 & 10.15.1 there was zero HTTP failures. For 10.15.2 there are on average two failures per test suite run (HTTP 502). In every run, a different test case fails so failures are not deterministic.
I tried to build a simple node server and reproduce the issue with it, but so far without any success. We will try to figure out what is the exact pattern and the volume of requests to reproduce the issue. Timing and the speed of the client might matter.
I guess that headersTimeout
should be longer than keepAliveTimeout
because after the first request of a keep-alive connection,headersTimeout
is applied to the period between the end of the previous request (even before its response is sent) and the first parsing of the next request.
@OrKoN What happens with your test suite if you put a headersTimeout
longer than keepAliveTimeout
?
Created a test case that reproduces the issue. It fails on 10.15.2 and 10.15.3. (Somehow headersTimeout
seems to work only when headers are sent in multiple packets.)
To illustrate the issue with an example of two requests on a keep-alive connection:
keepAliveTimeout
works for 4-6 (the period between 4 and 6). headersTimeout
works for 3-7. So headersTimeout
should be longer than keepAliveTimeout
in order to keep connections until keepAliveTimeout
.
I wonder whether headersTimeout
should include 3-6. 6-7 seems more intuitive for the name and should be enough for mitigating Slowloris DoS because 3-4 is up to the server and 4-6 is covered by keepAliveTimeout
.
@shuhei so you mean that headersTimeout
spans multiple requests on the same connection? I have not tried to change the headersTimeout
because I expected it to work for a single request only and we have no long requests in our test suite. It looks like the headers timer should reset when a new request arrives but it's defined by the first request for a connection.
@OrKoN Yes, headersTimeout
spans parts of two requests on the same connection including the interval between the two requests. Before https://github.com/nodejs/node/commit/1a7302bd48, it was only applied to the first request. The commit started resetting the headers timer when a request is done in order to apply headersTimeout
to subsequent requests in the same connection.
I see. So it looks like an additional place to reset the timer would be the beginning of a new request? And parserOnIncoming
is only called once the headers are parsed, so it need to be some other place then.
P.S. I will run our tests with increased headerTimeout today to see if it helps.
So we have applied the workaround (headersTimeout > keepAliveTimeout
) and the errors are gone. 馃帀
I faced this issue too. I configured my Nginx load balancer to use keepalive when connecting to Node upstreams. I already saw it dropping connections and found the reason. I switched to Node 10 after that and was surprised to see this happening again: Nginx reports that Node closed the connection unexpectedly and then Nginx disables that upstream for a while.
I have not seen this problem after tweaking header timeouts yesterday as proposed by @OrKoN above. I think this is a serious bug, since it results in load balancers switching nodes off and on.
Why does not anybody else find this bug alarming? My guess is that -
We're having the same problem after upgrading from 8.x to 10.15.3.
However, I don't think it's a regression from 10.15.2 to 10.15.3.
This discussion goes way back to this issue #13391
I forked the example code there and created a new test case that fails on all the following versions:
12.3.1, 10.15.3, 10.15.2, 10.15.1, 10.15.0, 8.11.2
The original code did not fail in a consistent way, which led me to believe there's some kind of a race condition, where between the keepAliveTimeout check and the connection termination, a new connection can try to reuse it.
So I tweaked the test so that:
The result are pretty consistent:
Error: socket hang up
at createHangUpError (_http_client.js:343:17)
at Socket.socketOnEnd (_http_client.js:444:23)
at Socket.emit (events.js:205:15)
at endReadableNT (_stream_readable.js:1137:12)
at processTicksAndRejections (internal/process/task_queues.js:84:9) {
code: 'ECONNRESET'
}
You can clone the code from yoavain/node8keepAliveTimeout
npm install
npm run test -- --keepAliveTimeout 5000
(Note that the keepAliveTimeout is also the client requests interval)
When setting keepAliveTimeout to 0, the problem is gone.
npm run test -- --keepAliveTimeout 5000 --keepAliveDisabled
Thanks for the info guys! This is a nasty issue that reared it's head when we went straight from 10.14 to 12. Node kept dropping our connections before the AWS Load Balancer knew about it. Once I set the ELB timeout < keepAliveTimeout < headersTimeout (we weren't even setting that one) the problem went away.
The original code did not fail in a consistent way, which led me to believe there's some kind of a race condition, where between the keepAliveTimeout check and the connection termination, a new connection can try to reuse it.
I can confirm I'm pretty sure we're seeing this as well (v10.13.0). We have Nginx in front of NodeJS within K8s. We were seeing random "connection reset by peer" or "upstream prematurely closed connection" for requests Nginx was sending to nodeJS apps. On all these occasions the problem was occurring for connections established by Nginx to Node. Right on the default 5 second keepAliveTimeout on the nodeJS side, nginx decided to reuse it's open/established connection to the node process and send another request (however technically outside of the 5 second timeout limit on the node side by <2ms). NodeJS accepted this new request over the existing connection, responded with an ACK packet, then <2ms later node also followed up with a RST packet closing the connection. However stracing the nodeJS process I could see the app code had received the request and was processing it, but before the response could be sent, node had already closed the connection. I would second the thoughts that there is a slight race condition between the point the connection is about to be closed by nodeJS but it still accepting an incoming request.
To avoid we simply increased the nodeJS keepAliveTimeout to be higher than Nginx's, thus giving Nginx the power over the keepAlive connections. http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive_timeout
PrintScreen of a packet capture taken on the nodeJS side of the connection is attached:
Wow, very interesting thread. I have a suspicion that we're facing similar issue in AppEngine Node.js Standard. ~100 502 errors a day from ~1M requests per day total (~0.01% of all requests)
Can confirm this is still the case in v10.19.0 release.
We have investigated an issue for a very small subset of requests as well and this was the root-cause. The behaviour we see is exactly what @markfermor's described, you can read even more in our investigation details. The following configuration lines did indeed solve the issue:
server.keepAliveTimeout = 76 * 1000;
server.headersTimeout = 77 * 1000;
This is what solved it for us in AppEngine:
this.server.keepAliveTimeout = 600 * 1000
@mcollina @OrKoN Is there a plan to release 10.15.4 with the fix?
Node 10.15.4 is not happening. At best, it might be backported to Node 10.20.1 or analog.
Would love to see it in 10.20.x or similar if it鈥檚 at all feasible to port it back.
Note that this also appears to be broken in v12, not just the recent v10s.
Since #32329 was merged, now I don't need to set headersTimeout
bigger than keepAliveTimeout
, right ?
@yokomotod if you're asking for v12, v12.19.0 contains the fix
https://github.com/nodejs/node/compare/v12.18.4...v12.19.0#diff-feaf3339998a19f0baf3f82414762c22
https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V12.md#12.19.0
Most helpful comment
So we have applied the workaround (
headersTimeout > keepAliveTimeout
) and the errors are gone. 馃帀