Node: Regression issue with keep alive connections

Created on 23 Apr 2019 · 23Comments · Source: nodejs/node

Version: 10.15.3
Platform: Linux
Subsystem:

Hi,

We updated the node version from 10.15.0 to 10.15.3 for a service which runs behind the AWS Application Load Balancer. After that our test suite revealed an issue which we didn't see before an update which results in HTTP 502 errors thrown by the load balancer. Previously, this was happening if the Node.js server closed a connection before the load balancer. We solved this by setting server.keepAliveTimeout = X where X is higher than the keep-alive timeout on the load balancer side.

With version 10.15.3 setting server.keepAliveTimeout = X does not work anymore and we see regular 502 errors by the load balancer. I have checked the changelog for Node.js, and it seems that there was a change related to keep-alive connection in 10.15.2 https://github.com/nodejs/node/commit/1a7302bd48 which might have caused the issue we are seeing.

Does anyone know if the mentioned change can cause the issue we are seeing? In particular, I believe the problem is that the connection is closed before the specified keep-alive timeout.

http

Source

OrKoN

👍30

Most helpful comment

So we have applied the workaround (headersTimeout > keepAliveTimeout) and the errors are gone. 🎉

OrKoN on 29 Apr 2019

👍13 🎉6

All 23 comments

// cc @nodejs/http

BridgeAR on 24 Apr 2019

The slowloris mitigations only apply to the HTTP headers parsing stage. Past that stage the normal timeouts apply (barring bugs, of course.)

Is it an option for you to try out 10.15.1 and 10.15.2, to see if they exhibit the same behavior?

bnoordhuis on 24 Apr 2019

In our test suite, there are about 250 HTTP requests. I have run the test suite four times for each of the following node versions 10.15.0, 10.15.1, 10.15.2. For 10.15.0 & 10.15.1 there was zero HTTP failures. For 10.15.2 there are on average two failures per test suite run (HTTP 502). In every run, a different test case fails so failures are not deterministic.

I tried to build a simple node server and reproduce the issue with it, but so far without any success. We will try to figure out what is the exact pattern and the volume of requests to reproduce the issue. Timing and the speed of the client might matter.

OrKoN on 24 Apr 2019

I guess that headersTimeout should be longer than keepAliveTimeout because after the first request of a keep-alive connection,headersTimeout is applied to the period between the end of the previous request (even before its response is sent) and the first parsing of the next request.

@OrKoN What happens with your test suite if you put a headersTimeout longer than keepAliveTimeout?

shuhei on 27 Apr 2019

👍3

Created a test case that reproduces the issue. It fails on 10.15.2 and 10.15.3. (Somehow headersTimeout seems to work only when headers are sent in multiple packets.)

To illustrate the issue with an example of two requests on a keep-alive connection:

A connection is made
The server receives the first packet of the first request's headers
The server receives the second packet of the first request's headers
The server sends the response for the first request
(...idle time...)
The server receives the first packet of the second request's headers
The server receives the second packet of the second request's headers

keepAliveTimeout works for 4-6 (the period between 4 and 6). headersTimeout works for 3-7. So headersTimeout should be longer than keepAliveTimeout in order to keep connections until keepAliveTimeout.

I wonder whether headersTimeout should include 3-6. 6-7 seems more intuitive for the name and should be enough for mitigating Slowloris DoS because 3-4 is up to the server and 4-6 is covered by keepAliveTimeout.

shuhei on 28 Apr 2019

❤11

@shuhei so you mean that headersTimeout spans multiple requests on the same connection? I have not tried to change the headersTimeout because I expected it to work for a single request only and we have no long requests in our test suite. It looks like the headers timer should reset when a new request arrives but it's defined by the first request for a connection.

OrKoN on 29 Apr 2019

@OrKoN Yes, headersTimeout spans parts of two requests on the same connection including the interval between the two requests. Before https://github.com/nodejs/node/commit/1a7302bd48, it was only applied to the first request. The commit started resetting the headers timer when a request is done in order to apply headersTimeout to subsequent requests in the same connection.

shuhei on 29 Apr 2019

I see. So it looks like an additional place to reset the timer would be the beginning of a new request? And parserOnIncoming is only called once the headers are parsed, so it need to be some other place then.

P.S. I will run our tests with increased headerTimeout today to see if it helps.

OrKoN on 29 Apr 2019

So we have applied the workaround (headersTimeout > keepAliveTimeout) and the errors are gone. 🎉

OrKoN on 29 Apr 2019

👍13 🎉6

I faced this issue too. I configured my Nginx load balancer to use keepalive when connecting to Node upstreams. I already saw it dropping connections and found the reason. I switched to Node 10 after that and was surprised to see this happening again: Nginx reports that Node closed the connection unexpectedly and then Nginx disables that upstream for a while.

I have not seen this problem after tweaking header timeouts yesterday as proposed by @OrKoN above. I think this is a serious bug, since it results in load balancers switching nodes off and on.

Why does not anybody else find this bug alarming? My guess is that -

there are no traces of it on Node instances itself. No log messages, nothing.
Web users connecting to Node services directly may simply ignore that few connections are dropped. The rate was not high in my case (maybe a couple of dozens per day while we serve millions of connections daily), so the chance of a particular visitor to experience this is relatively small.
and I found the bug indirectly based on the load balancer's logs: not everyone keeps an eye on the logs closely.

alaz on 1 May 2019

👍2

We're having the same problem after upgrading from 8.x to 10.15.3.
However, I don't think it's a regression from 10.15.2 to 10.15.3.
This discussion goes way back to this issue #13391
I forked the example code there and created a new test case that fails on all the following versions:
12.3.1, 10.15.3, 10.15.2, 10.15.1, 10.15.0, 8.11.2

The original code did not fail in a consistent way, which led me to believe there's some kind of a race condition, where between the keepAliveTimeout check and the connection termination, a new connection can try to reuse it.

So I tweaked the test so that:

The server time to answer a request is 3 * keepAliveTimeout minus few (random) milliseconds (keeping 2-3 request alive).
The client fires not only one request, but a request every (exactly) keepAliveTimeout. This makes sure that the client request are aligned with the server connection keepAliveTimeout.

The result are pretty consistent:

Error: socket hang up
    at createHangUpError (_http_client.js:343:17)
    at Socket.socketOnEnd (_http_client.js:444:23)
    at Socket.emit (events.js:205:15)
    at endReadableNT (_stream_readable.js:1137:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:9) {
  code: 'ECONNRESET'
}

You can clone the code from yoavain/node8keepAliveTimeout

npm install
npm run test -- --keepAliveTimeout 5000
(Note that the keepAliveTimeout is also the client requests interval)

When setting keepAliveTimeout to 0, the problem is gone.

npm run test -- --keepAliveTimeout 5000 --keepAliveDisabled

yoavain on 23 May 2019

👍4 👀1

Thanks for the info guys! This is a nasty issue that reared it's head when we went straight from 10.14 to 12. Node kept dropping our connections before the AWS Load Balancer knew about it. Once I set the ELB timeout < keepAliveTimeout < headersTimeout (we weren't even setting that one) the problem went away.

dansantner on 10 Oct 2019

👍2

The original code did not fail in a consistent way, which led me to believe there's some kind of a race condition, where between the keepAliveTimeout check and the connection termination, a new connection can try to reuse it.

I can confirm I'm pretty sure we're seeing this as well (v10.13.0). We have Nginx in front of NodeJS within K8s. We were seeing random "connection reset by peer" or "upstream prematurely closed connection" for requests Nginx was sending to nodeJS apps. On all these occasions the problem was occurring for connections established by Nginx to Node. Right on the default 5 second keepAliveTimeout on the nodeJS side, nginx decided to reuse it's open/established connection to the node process and send another request (however technically outside of the 5 second timeout limit on the node side by <2ms). NodeJS accepted this new request over the existing connection, responded with an ACK packet, then <2ms later node also followed up with a RST packet closing the connection. However stracing the nodeJS process I could see the app code had received the request and was processing it, but before the response could be sent, node had already closed the connection. I would second the thoughts that there is a slight race condition between the point the connection is about to be closed by nodeJS but it still accepting an incoming request.

To avoid we simply increased the nodeJS keepAliveTimeout to be higher than Nginx's, thus giving Nginx the power over the keepAlive connections. http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive_timeout

PrintScreen of a packet capture taken on the nodeJS side of the connection is attached:

markfermor on 16 Oct 2019

👍6

Wow, very interesting thread. I have a suspicion that we're facing similar issue in AppEngine Node.js Standard. ~100 502 errors a day from ~1M requests per day total (~0.01% of all requests)

kirillgroshkov on 30 Oct 2019

👍3

Can confirm this is still the case in v10.19.0 release.

michalschott on 18 Feb 2020

We have investigated an issue for a very small subset of requests as well and this was the root-cause. The behaviour we see is exactly what @markfermor's described, you can read even more in our investigation details. The following configuration lines did indeed solve the issue:

server.keepAliveTimeout = 76 * 1000;
server.headersTimeout = 77 * 1000;

sroze on 24 Mar 2020

👍1

This is what solved it for us in AppEngine:

this.server.keepAliveTimeout = 600 * 1000

kirillgroshkov on 24 Mar 2020

👍2 🎉1

@mcollina @OrKoN Is there a plan to release 10.15.4 with the fix?

vrozov on 9 Apr 2020

Node 10.15.4 is not happening. At best, it might be backported to Node 10.20.1 or analog.

mcollina on 9 Apr 2020

👍2

Would love to see it in 10.20.x or similar if it’s at all feasible to port it back.

wyardley on 9 Apr 2020

👍2

Note that this also appears to be broken in v12, not just the recent v10s.

glasser on 14 May 2020

Since #32329 was merged, now I don't need to set headersTimeout bigger than keepAliveTimeout, right ?

yokomotod on 22 Jun 2020

@yokomotod if you're asking for v12, v12.19.0 contains the fix

https://github.com/nodejs/node/compare/v12.18.4...v12.19.0#diff-feaf3339998a19f0baf3f82414762c22

https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V12.md#12.19.0

Xilis on 7 Oct 2020

🎉2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

stream, test: add test for _readableState.readingMore

mcollina · 3Comments

deps: fix TODO in tools/icu/README.md

srl295 · 3Comments

Crypto SHA1 - NodeJs v C#

filipesilvaa · 3Comments

Close and exit event triggering on launching child process

sandeepks1 · 3Comments

querystring should maybe support Buffers

addaleax · 3Comments